Statistics Tutorial for Data Science

1. Introduction

This tutorial covers essential statistical concepts used in data science. We will discuss mean, median, mode, variance, and standard deviation. We will use a dataframe to demonstrate how to calculate these statistics and interpret them using pandas in Python.

2. Dataframe Creation

First, let's create a sample dataframe:

import pandas as pd

# Creating a dataframe
data = {'Values': [12, 15, 13, 10, 8, 12, 14, 9, 11, 12]}
df = pd.DataFrame(data)
print(df)

Index	Values
0	12
1	15
2	13
3	10
4	8
5	12
6	14
7	9
8	11
9	12

3. Sorted Dataframe

To sort the dataframe in ascending order by the values, use the following code:

sorted_df = df.sort_values(by='Values').reset_index(drop=True)
print(sorted_df)

Index	Values
4	8
7	9
3	10
8	11
0	12
5	12
9	12
2	13
6	14
1	15

4. Mean

The mean is the average of all the numbers in a dataset. It is calculated using the formula:

Mean (μ) = ∑ X_i / N

Where:

X_i = individual value
N = total number of values

mean_value = df['Values'].mean()
print(f'Mean: {mean_value}')

In our example, the mean is:

Mean = 11.6

5. Median

The median is the middle value when a dataset is ordered from least to greatest. If there is an even number of observations, the median is the average of the two middle numbers.

Here is the sorted dataframe for clarity:

sorted_df = df.sort_values(by='Values')
print(sorted_df)

Index	Values
4	8
7	9
3	10
8	11
0	12
5	12
9	12
2	13
6	14
1	15

median_value = df['Values'].median()
print(f'Median: {median_value}')

In our example, the median is:

Median = 12.0

6. Mode

The mode is the value that appears most frequently in a dataset. Here is the sorted dataframe again for clarity:

mode_value = df['Values'].mode()[0]
print(f'Mode: {mode_value}')

In our example, the mode is:

Mode = 12

7. Variance

Variance measures the spread of the data points. It is calculated using the formula:

Variance (σ²) = ∑ (X_i - μ)² / (n-1)

Where:

μ = mean of the data
X_i = individual value
n = total number of values

Let's calculate the variance step by step:

Step-by-Step Calculation for Sample Variance

1. Subtract the mean from each value (X_i - μ).

2. Square the result of each subtraction to get (X_i - μ)².

3. Sum all the squared differences.

4. Divide the sum by the total number of values less 1 which is (n-1).

df['(Xi - μ)'] = df['Values'] - mean_value
df['(Xi - μ)^2'] = df['(Xi - μ)'] ** 2
sum_of_squares = df['(Xi - μ)^2'].sum()
variance_value = sum_of_squares / len(df)
print(f'Sum of Squares: {sum_of_squares}')
print(f'Variance: {variance_value}')

Index	Values	(Xi - μ)	(Xi - μ)²
0	12	0.4	0.16
1	15	3.4	11.56
2	13	1.4	1.96
3	10	-1.6	2.56
4	8	-3.6	12.96
5	12	0.4	0.16
6	14	2.4	5.76
7	9	-2.6	6.76
8	11	-0.6	0.36
9	12	0.4	0.16

The variance is:

Sample Variance = 42.4/9 = 4.71

8. Standard Deviation

Standard deviation is the square root of the variance and provides a measure of the average distance from the mean.

Standard Deviation (σ) = √σ²

import math
std_dev_value = math.sqrt(variance_value)
print(f'Standard Deviation: {std_dev_value}')

In our example, the standard deviation is:

Standard Deviation = 2.17

9. Using pd.describe()

The pandas `describe()` function provides a quick overview of the most common statistics for a dataset:

description = df['Values'].describe()
print(description)

In our example, the describe output is:

count	10.000000
mean	11.600000
std	2.170509
min	8.000000
25%	10.250000
50%	12.000000
75%	12.750000
max	15.000000
Name: Values, dtype: float64

10. Explanation of .describe() Output

count: The number of non-null observations in the dataset. In our case, there are 10 values.
mean: The average of all the values. It is calculated as the sum of all values divided by the count. Here, the mean is 11.8.
std: The standard deviation, which measures the dispersion of the values from the mean. Here, it is 2.39.
min: The smallest value in the dataset. In this case, it is 8.
25%: The 25th percentile, or the value below which 25% of the observations fall. Here, it is 10.25.
50%: The median or the 50th percentile, which is the middle value of the dataset when ordered. Here, it is 12.0.
75%: The 75th percentile, or the value below which 75% of the observations fall. In this case, it is 13.75.
max: The largest value in the dataset. Here, it is 15.

11. Frequently Asked Questions (FAQs)

Q1: What is the difference between mean and median?

A1: The mean is the average of all values, calculated by dividing the sum of values by the count. The median is the middle value when the data is ordered. The median is less affected by outliers compared to the mean.

Q2: Why is variance important?

A2: Variance measures the dispersion of data points around the mean. It helps us understand the spread and variability in the dataset. A higher variance indicates more spread out data points.

Q3: How do I interpret the standard deviation?

A3: The standard deviation is the square root of the variance and provides a measure of the average distance of data points from the mean. A smaller standard deviation means the data points are closer to the mean, while a larger one indicates more spread.

Q4: What does the 'describe()' function do in pandas?

A4: The `describe()` function provides a summary of statistics for the data, including count, mean, standard deviation, minimum, maximum, and percentiles. It helps in quickly understanding the distribution and spread of the data.

12. Conclusion

In this tutorial, we have covered essential statistical concepts and their calculations using Python's pandas library. Understanding these statistics is crucial for analyzing and interpreting data effectively in data science.