What is Data Munging?

Data munging is a set of concepts and a methodology for taking data from unusable and erroneous forms to the new levels of structure and quality required by modern analytics processes and consumers.

How does Data Munging work?

  1. Data exploration: Munging usually begins with data exploration. Whether an analyst is merely peeking at completely new data in initial data analysis (IDA), or a data scientist begins the search for novel associations in existing records in exploratory data analysis (EDA), munging always begins with some degree of data discovery.

  2. Data transformation: Once a sense of the raw data’s contents and structure have been established, it must be transformed into new formats appropriate for downstream processing. This step involves the pure restructuring of data, for example, un-nesting hierarchical JSON data, denormalizing disparate tables so relevant information can be accessed from one place, or reshaping and aggregating time series data to the dimensions and spans of interest.

  3. Data enrichment: Optionally, once data is ready for consumption, data managers might choose to perform additional enrichment steps. This involves finding external sources of information to expand the scope or content of existing records. For example, using an open-source weather data set to add daily temperature to an ice-cream shop’s sales figures.

  4. Data validation: The final, perhaps most important, munging step is validation. At this point, the data is ready to be used, but certain common-sense or sanity checks are critical if one wishes to trust the processed data. This step allows users to discover typos, incorrect mappings, problems with transformation steps, even the rare corruption caused by computational failure or error.


