Data mining is the collection of processes by which we can extract useful insights from data. Inherent in this definition is the idea of data reduction: useful insights (whether in the form of summaries, sentiment analyses, etc.) ought to be “smaller” and “more organized” than the original raw data. The challenges presented by high data dimensionality (the so-called curse of dimensionality) must be addressed in order to achieve insightful and interpretable analytical results. In this report, we introduce the basic principles of dimensionality reduction and a number of feature selection methods (filter, wrapper, regularization), and discuss some current advanced topics (SVD, spectral feature selection, UMAP) and provide examples (with code).
Data Science Report Series #8: Feature Selection and Dimension Reduction, by Patrick Boily, Olivier Leduc, Andrew Macfie, Aditya Maheshwari, and Maia Pelletier.