Site Loader

Once the raw data has been collected and stored in a dataset that is accessible to data analysts/data scientists, the focus should shift to data cleaning and processing. This requires testing for soundness and fixing errors, designing and implementing strategies to deal with missing values and outlying/influential observations, as well as low-level exploratory data analysis and visualisation to determine what data transformations and dimension reduction approaches will be needed in the final analysis. Analysts should be prepared to spend up to 80% of their time processing and cleaning the data.

Data Science Report Series #7: Data Preparation (Draft), by Patrick Boily, Jennifer Schellinck, and Shintaro Hagiwara.

Post Author: Patrick Boily

Patrick is interested in the applications of mathematics and statistcs to evidence-based decision support. He has worked on 25+ such projects since 2008. He has extensive experience in data science, machine learning, A.I. and predictive analytics, data cleaning and data visualization.