This is the course project of Seminar Data Mining (IN0014, IN4927) in Technical University of Munich. My topic described as "Give a quick overview over the data mining pipeline and then focus on methods to preprocess data such that it can be successfully tackled by data mining methods." The final submission is a 6-page paper with a 20-minute presentation. In this blog, I will post the abstract and introduction of my paper. The full text of this work can be found at arXiv.
Data mining is about obtaining new knowledge from existing datasets. However, the data in the existing datasets can be scattered, noisy, and even incomplete. Although lots of effort is spent on developing or fine-tuning data mining models to make them more robust to the noise of the input data, their qualities still strongly depend on the quality of it. The article starts with an overview of the data mining pipeline, where the procedures in a data mining task are briefly introduced. Then an overview of the data preprocessing techniques, which are categorized as data cleaning, data transformation, and data preprocessing, is given. Detailed preprocessing methods, as well as their influence on the data mining models, are covered in this article.
Data mining is a knowledge obtaining process: it gets data from various data sources and finally transforms the data into knowledge, thus provides insight to its application field. Data mining pipeline is a typical example of the end-to-end data mining system: they are an integration of all data mining procedures and deliver the knowledge directly from data source to human.
The purpose of data preprocessing is to make the data easier for data mining models to tackle. The quality of data can have a significant influence on data mining models. It is considered that the data and features have already set the upper bound of the knowledge that can be obtained, and the data mining models are just about approximating the upper bound. Various preprocessing techniques are invented to make the data meet the input requirements of the model, improve the relevance of the prediction target, and make the optimization step of the model easier.
It is common that raw data obtained from the natural world is badly shaped. The problems include the appearance of missing values (e.g., a patient did not go through all the tests), duplications (e.g., annual income and monthly income), outlier values (e.g., age is -1) as well as contradictions (e.g., gender is male and is pregnant) in the dataset. Although the existing preprocessing techniques would not guarantee to solve all these problems, they could at least correct some of them and improve the performance of the models.
The data type and distribution of data are usually transformed before being sent to data mining models. The purpose of data transformation includes making the data meets the input requirement of the models, removing the noise of data, and making the distribution of data more suitable for applying optimization algorithms in the model training step.
The input for data mining models can be huge: they may have too many dimensions or of massive amount, which would make it difficult for the data mining model to train or cause troubles while transferring and storing the data. Data reduction techniques can reduce the problem by applying reduction on dimensions (known as dimensional reduction) or amounts of data (known as instance selection and sampling).
To implement preprocessing to data, Python and R are among the most popular tools. With bulks of packages such as scikit-learn and PreProcess, most of the preprocessing algorithms covered in this paper can be implemented even without consideration of its details.
In the following section, the data mining pipeline and the primary procedures in the data mining pipeline will be introduced. From Section 2 on, we will focus on the steps in the data preprocessing work: Section 3 will introduce the techniques used in data cleaning, while Section 4 will cover the data transformation techniques. In the last section, data reduction techniques will be discussed.