Cleaning and Imputing Null Values in Python

This tutorial will guide you through the process of cleaning and imputing null values in a dataset using Python. We'll use the pandas library to handle dataframes.

1. Importing Necessary Libraries

        
import pandas as pd
        
    

2. Creating a Sample DataFrame

Let's start by creating a sample DataFrame with some null values:

        
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [24, None, 22, 23, None],
    'City': ['New York', 'Los Angeles', None, 'Chicago', 'Houston']
}
df = pd.DataFrame(data)
print(df)
        
    

The output of the DataFrame will be:

        
       Name   Age         City
0    Alice  24.0     New York
1      Bob   NaN  Los Angeles
2  Charlie  22.0         None
3    David  23.0      Chicago
4      Eve   NaN      Houston
        
    

3. Identifying Null Values

We can identify null values in the DataFrame using the isnull() and sum() methods:

        
print(df.isnull().sum())
        
    

The output will show the number of null values in each column:

        
Name     0
Age      2
City     1
dtype: int64
        
    

4. Dropping Rows with Null Values

If we want to drop rows that contain any null values, we can use the dropna() method:

        
df_dropped = df.dropna()
print(df_dropped)
        
    

The output will be:

        
     Name   Age      City
0  Alice  24.0  New York
3  David  23.0   Chicago
        
    

5. Filling Null Values with a Specific Value

To fill null values with a specific value, use the fillna() method. For example, we can fill null values with 0:

        
df_filled = df.fillna(0)
print(df_filled)
        
    

The output will be:

        
       Name   Age         City
0    Alice  24.0     New York
1      Bob   0.0  Los Angeles
2  Charlie  22.0            0
3    David  23.0      Chicago
4      Eve   0.0      Houston
        
    

6. Filling Null Values with the Mean/Median/Mode

We can also fill null values with the mean, median, or mode of the column. Here is how to fill null values with the mean of the 'Age' column:

        
mean_age = df['Age'].mean()
df['Age'].fillna(mean_age, inplace=True)
print(df)
        
    

The output will be:

        
       Name        Age         City
0    Alice  24.000000     New York
1      Bob  23.000000  Los Angeles
2  Charlie  22.000000         None
3    David  23.000000      Chicago
4      Eve  23.000000      Houston
        
    

7. Interpolating Null Values

Interpolation can be used to estimate missing values. Here's how to use linear interpolation:

        
df['Age'] = df['Age'].interpolate()
print(df)
        
    

The output will be:

        
       Name   Age         City
0    Alice  24.0     New York
1      Bob  23.5  Los Angeles
2  Charlie  22.0         None
3    David  23.0      Chicago
4      Eve  23.0      Houston
        
    

8. Filling Null Values in Categorical Data

For categorical data, the mode (most frequent value) is often used to fill null values:

        
mode_city = df['City'].mode()[0]
df['City'].fillna(mode_city, inplace=True)
print(df)
        
    

The output will be:

        
       Name   Age         City
0    Alice  24.0     New York
1      Bob  23.5  Los Angeles
2  Charlie  22.0  Los Angeles
3    David  23.0      Chicago
4      Eve  23.0      Houston
        
    

Questions & Answers

Q1: Why is it important to handle null values in a dataset?
A1: Handling null values is crucial because they can lead to errors in data analysis and modeling. Null values can distort statistical analyses and machine learning models, resulting in inaccurate predictions and insights.
Q2: What are some common methods to handle null values?
A2: Common methods to handle null values include:
Q3: When should you drop rows with null values instead of filling them?
A3: Dropping rows with null values is appropriate when:
Q4: How does interpolation work for imputing null values?
A4: Interpolation estimates missing values based on the values before and after the null value. For example, linear interpolation calculates a missing value by assuming it lies on a straight line between known values. This method is particularly useful for time series or ordered data.