This tutorial covers essential statistical concepts used in data science. We will discuss mean, median, mode, variance, and standard deviation. We will use a dataframe to demonstrate how to calculate these statistics and interpret them using pandas in Python.
First, let's create a sample dataframe:
import pandas as pd # Creating a dataframe data = {'Values': [12, 15, 13, 10, 8, 12, 14, 9, 11, 12]} df = pd.DataFrame(data) print(df)
Index | Values |
---|---|
0 | 12 |
1 | 15 |
2 | 13 |
3 | 10 |
4 | 8 |
5 | 12 |
6 | 14 |
7 | 9 |
8 | 11 |
9 | 12 |
To sort the dataframe in ascending order by the values, use the following code:
sorted_df = df.sort_values(by='Values').reset_index(drop=True) print(sorted_df)
Index | Values |
---|---|
4 | 8 |
7 | 9 |
3 | 10 |
8 | 11 |
0 | 12 |
5 | 12 |
9 | 12 |
2 | 13 |
6 | 14 |
1 | 15 |
The mean is the average of all the numbers in a dataset. It is calculated using the formula:
Where:
mean_value = df['Values'].mean() print(f'Mean: {mean_value}')
In our example, the mean is:
Mean = 11.6
The median is the middle value when a dataset is ordered from least to greatest. If there is an even number of observations, the median is the average of the two middle numbers.
Here is the sorted dataframe for clarity:
sorted_df = df.sort_values(by='Values') print(sorted_df)
Index | Values |
---|---|
4 | 8 |
7 | 9 |
3 | 10 |
8 | 11 |
0 | 12 |
5 | 12 |
9 | 12 |
2 | 13 |
6 | 14 |
1 | 15 |
median_value = df['Values'].median() print(f'Median: {median_value}')
In our example, the median is:
Median = 12.0
The mode is the value that appears most frequently in a dataset. Here is the sorted dataframe again for clarity:
mode_value = df['Values'].mode()[0] print(f'Mode: {mode_value}')
In our example, the mode is:
Mode = 12
Variance measures the spread of the data points. It is calculated using the formula:
Where:
Let's calculate the variance step by step:
1. Subtract the mean from each value (Xi - μ).
2. Square the result of each subtraction to get (Xi - μ)2.
3. Sum all the squared differences.
4. Divide the sum by the total number of values less 1 which is (n-1).
df['(Xi - μ)'] = df['Values'] - mean_value df['(Xi - μ)^2'] = df['(Xi - μ)'] ** 2 sum_of_squares = df['(Xi - μ)^2'].sum() variance_value = sum_of_squares / len(df) print(f'Sum of Squares: {sum_of_squares}') print(f'Variance: {variance_value}')
Index | Values | (Xi - μ) | (Xi - μ)2 |
---|---|---|---|
0 | 12 | 0.4 | 0.16 |
1 | 15 | 3.4 | 11.56 |
2 | 13 | 1.4 | 1.96 |
3 | 10 | -1.6 | 2.56 |
4 | 8 | -3.6 | 12.96 |
5 | 12 | 0.4 | 0.16 |
6 | 14 | 2.4 | 5.76 |
7 | 9 | -2.6 | 6.76 |
8 | 11 | -0.6 | 0.36 |
9 | 12 | 0.4 | 0.16 |
The variance is:
Sample Variance = 42.4/9 = 4.71
Standard deviation is the square root of the variance and provides a measure of the average distance from the mean.
import math std_dev_value = math.sqrt(variance_value) print(f'Standard Deviation: {std_dev_value}')
In our example, the standard deviation is:
Standard Deviation = 2.17
The pandas `describe()` function provides a quick overview of the most common statistics for a dataset:
description = df['Values'].describe() print(description)
In our example, the describe output is:
count 10.000000 mean 11.600000 std 2.170509 min 8.000000 25% 10.250000 50% 12.000000 75% 12.750000 max 15.000000 Name: Values, dtype: float64
A1: The mean is the average of all values, calculated by dividing the sum of values by the count. The median is the middle value when the data is ordered. The median is less affected by outliers compared to the mean.
A2: Variance measures the dispersion of data points around the mean. It helps us understand the spread and variability in the dataset. A higher variance indicates more spread out data points.
A3: The standard deviation is the square root of the variance and provides a measure of the average distance of data points from the mean. A smaller standard deviation means the data points are closer to the mean, while a larger one indicates more spread.
A4: The `describe()` function provides a summary of statistics for the data, including count, mean, standard deviation, minimum, maximum, and percentiles. It helps in quickly understanding the distribution and spread of the data.
In this tutorial, we have covered essential statistical concepts and their calculations using Python's pandas library. Understanding these statistics is crucial for analyzing and interpreting data effectively in data science.