Skip to main content

Week 3 - Data Preprocessing

Dirty data

Incomplete: missing value

Noisy: errors, outliers (e.g. age = -42)

Inconsistent: format (e.g. 2000/01/01 vs 01/01/2000)

Intentional: fake data

Data cleaning

Equal-width data binning

Divides the range into N intervals of equal size

Bad for outliers

Bad for skewed data

Equal-depth data binning

Divides the range into N intervals, each containing approximately same number of samples

Data transformation

  • Smoothing

  • Aggregation

  • Generalization

Data smoothing

Python | Binning method for data smoothing - GeeksforGeeks