Research on the Effectiveness of Different Outlier Detection Methods in Common Data Distribution Types

Published: 2024-04-27

Outlier detection is widely applied in domains such as network performance optimization and machine learning data preprocessing. In machine learning, its objective is to improve data quality, thereby enhancing the performance of subsequent statistical analysis or machine learning models. Numerous effective and reliable outlier analysis methods exist, yet their effectiveness varies significantly across different data distribution types. Consequently, selecting an appropriate outlier analysis method is critical. This study conducts outlier detection on sample data from five continuous probability distributions (normal, chi-square, exponential, gamma, and t distributions) and four discrete probability distributions (binomial, Poisson, geometric, and hypergeometric distributions). Five outlier detection methods are employed — Z-Score, IQR, DBSCAN, Isolation Forest, and Random Forest — and their detection effectiveness is assessed. Through comparison and analysis, the characteristics exhibited by each outlier detection method when processing sample data from different distribution types are summarized. These findings will facilitate more informed method selection when confronted with diverse outlier detection scenarios.