Abstract:
The aim of this master thesis was to develop a semi-automated data cleansing tool that can automatically perform data cleansing procedures on noisy datasets with minimal user input, resulting in clean and quality datasets ready for modelling. As a part of this project, I developed a desktop application which consists of a simple-to-use and informative graphical user interface that can take the initial dataset as input from the user and produce the cleaned one as an output along with the log file listing the actions made during the processing. In addition to the application, I also developed an automated rule-based algorithm that decides which data cleaning and pre-processing techniques to use for better results in terms of the quality of the produced dataset.
The application was developed by using PyQt package of Python, whereas the program logic utilized other machine learning and visualization packages such as sklearn and matplotlib. The performance of the application was evaluated in terms of accuracy, quality, and reproducibility. For this purpose, the datasets were added noisiness in terms of the number of missing values and duplicates in four different ratios and then cleaned and pre-processed by using the semiautomated data cleansing tool. The produced cleaned datasets are then fed into the random forest classifier to evaluate the variations in the performance of the model in terms of accuracy. The performance of the program was also examined in terms of the processing time depending on the noise ratio in the dataset. Finally, I evaluated the performance of the automatic mode of the program by comparing the accuracy of the machine learning model depending on the cleaning mode, namely Auto and Manual.
Overall, the application successfully performed automatic data cleansing and pre-processing techniques such as duplicate data handling, outlier handling, missing data imputation, normalization, and standardization without user interference. Additionally, it produced mostly better results in terms of the quality of the data in the automatic mode rather than in the manual. Finally, the application produced not only a cleaned dataset, but also successfully visualized the detected noise in the dataset and logged them with the respective modification made to it for further reference by the user.