Week 3 - Data Preprocessing
Dirty data
Incomplete: missing value
Noisy: errors, outliers (e.g. age = -42)
Inconsistent: format (e.g. 2000/01/01 vs 01/01/2000)
Intentional: fake data
Data cleaning
Equal-width data binning
Divides the range into N intervals of equal size
Bad for outliers
Bad for skewed data
Equal-depth data binning
Divides the range into N intervals, each containing approximately same number of samples
Data transformation
-
Smoothing
-
Aggregation
-
Generalization