Proceedings
Unsupervised Machine Learning in Data Cleaning Process and Clustering of Indonesia Dengue Fever Dataset 2016
Unsupervised machine learning as a part of machine learning is an algorithm that identifies available information without any guidance or supervision in a form of clustering. When producing clustering analysis, obtaining optimized data is crucial. Missing values, zero values, inappropriate data input and outliers can affect the overall analysis of the dataset. Data cleaning tools in the data cleaning process serves as an aid in order to achieve optimized data of dengue fever 2016 dataset. Data cleaning tools used in the data cleaning process are KNNImputer and mahalanobis distance. Correlation matrix and K-means clustering are used to analyze and cluster significant variables. It was found that KNNImputer and mahalanobis distance is suitable for imputation methods and outliers detection of dengue fever dataset 2016. Correlation matrix showed linear positive relationship in regards to total cases and total death cases of dengue fever 2016. K-means clustering of dataset with CFR result in higher accuracy where 15 out of 25 areas are accurately clustered, whereas in the dataset without significant variables the accuracy is lower resulting in 12 out of 25 areas that are accurately clustered
No other version available