Data cleansing (cleaning) is not a very structured practice since data comes in all different sizes and formats. In general, theses are the steps to follow
1. Import data
First, you have to make sure that you have the right tool to clean your data. Excel and Google Sheet work for small dataset on a spreadsheet, but they don’t work on databases or larger dataset. If you have a database or a larger dataset, you’d probably need to use a cloud-based software to clean your data and build a pipeline even.
2. Build data (join/merge)
This is the fun part. When you import all data into one place, you can start combining these datasets together by joining or merging. Since databases or big data are stored in an organized and more often relational manner, you need to combine them “logically”. For example, you’ll need to find a primary key to join two tables together, and make sure columns and schema are similar when merging multiple tables too.
3. Data deduplication, extraction, handle missing data and rebuilding
This is the complicated part. Your data can be messy. Maybe you want to filter out the irrelevant records, change the column name, handle the missing data, and delete the duplicated data etc. All these need be done based on your own domain knowledge and sometimes a bit of math/statistics. I would consider finding an expert to do it if your data is too hard to handle. Because it may take much less time for them than for you.
4. Validate and QA
Next you will need to make sure that the data is good for your applications. Whether you’re building an App, chart, dashboard, table, or an archive, you’ll make sure that the data fits your production needs. One of the best way to do this is to build a quick chart yourself and see if the data makes sense.
Once your data is good to go, you can start transporting or publishing it.
Once you’ve done cleaning your data, it should be valid, accurate, complete, consistent and uniform. Since the consumer or audience for your data is probably not as knowledgeable about the data as yourself, you need to make it very formatted, and comprehensive. This way, your data will power you or your team’s decision making much more efficiently.
If all these steps are still too vague, you can use a data cleaning software to import, merge, find & replace, deduplicate, and filter your data easily.