This tutorial will guide you through the process of cleaning and imputing null values in a dataset using Python. We'll use the pandas library to handle dataframes.
import pandas as pd
Let's start by creating a sample DataFrame with some null values:
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [24, None, 22, 23, None],
'City': ['New York', 'Los Angeles', None, 'Chicago', 'Houston']
}
df = pd.DataFrame(data)
print(df)
The output of the DataFrame will be:
Name Age City
0 Alice 24.0 New York
1 Bob NaN Los Angeles
2 Charlie 22.0 None
3 David 23.0 Chicago
4 Eve NaN Houston
We can identify null values in the DataFrame using the isnull()
and sum()
methods:
print(df.isnull().sum())
The output will show the number of null values in each column:
Name 0
Age 2
City 1
dtype: int64
If we want to drop rows that contain any null values, we can use the dropna()
method:
df_dropped = df.dropna()
print(df_dropped)
The output will be:
Name Age City
0 Alice 24.0 New York
3 David 23.0 Chicago
To fill null values with a specific value, use the fillna()
method. For example, we can fill null values with 0:
df_filled = df.fillna(0)
print(df_filled)
The output will be:
Name Age City
0 Alice 24.0 New York
1 Bob 0.0 Los Angeles
2 Charlie 22.0 0
3 David 23.0 Chicago
4 Eve 0.0 Houston
We can also fill null values with the mean, median, or mode of the column. Here is how to fill null values with the mean of the 'Age' column:
mean_age = df['Age'].mean()
df['Age'].fillna(mean_age, inplace=True)
print(df)
The output will be:
Name Age City
0 Alice 24.000000 New York
1 Bob 23.000000 Los Angeles
2 Charlie 22.000000 None
3 David 23.000000 Chicago
4 Eve 23.000000 Houston
Interpolation can be used to estimate missing values. Here's how to use linear interpolation:
df['Age'] = df['Age'].interpolate()
print(df)
The output will be:
Name Age City
0 Alice 24.0 New York
1 Bob 23.5 Los Angeles
2 Charlie 22.0 None
3 David 23.0 Chicago
4 Eve 23.0 Houston
For categorical data, the mode (most frequent value) is often used to fill null values:
mode_city = df['City'].mode()[0]
df['City'].fillna(mode_city, inplace=True)
print(df)
The output will be:
Name Age City
0 Alice 24.0 New York
1 Bob 23.5 Los Angeles
2 Charlie 22.0 Los Angeles
3 David 23.0 Chicago
4 Eve 23.0 Houston