Hypothesis Testing Tutorial

This tutorial covers various hypothesis testing methods using the Iris dataset. We will go through the following tests:

Z-Test
T-Test
Chi-Square Test
ANOVA (Analysis of Variance)

Dataset

import pandas as pd
# Load the Iris dataset
data = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
# Show the head of the dataset
print(data.head())

Z-Test

One-sample Z-test

The one-sample Z-test tests if the sample mean (x ) is significantly different from a known population mean (μ).

Z = ^{(x
- μ)}/_(σ/√n)

Where:

x = sample mean
μ = population mean
σ = population standard deviation
n = sample size

from scipy.stats import norm
import numpy as np
import pandas as pd

# Load the Iris dataset
data = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')

# Sample data: sepal length of Iris-setosa
data_setosa = data[data['species'] == 'setosa']['sepal_length']

# Known population parameters
mu = 5.0
sigma = 0.5

# Calculate the sample mean and sample size
sample_mean = np.mean(data_setosa)
n = len(data_setosa)

# Calculate the Z-statistic
z_stat = (sample_mean - mu) / (sigma / np.sqrt(n))

# Calculate the p-value
p_value = 2 * (1 - norm.cdf(abs(z_stat)))

print(f"Z-statistic: {z_stat}, P-value: {p_value}")

Output:

Z-statistic: 0.08485281374238891, P-value: 0.932378405606689

In our example, the sample mean is the average sepal length of the Iris-setosa species from the sample dataset. The population mean (𝜇) is given as 5.0.

The output of the Z-test provides two key values: the Z-statistic and the p-value.

Z-statistic: 0.08485281374238891

This value represents the number of standard deviations the sample mean is away from the population mean.

A Z-statistic close to 0 indicates that the sample mean is very close to the population mean. In this case, the Z-statistic is approximately 0.085, which means the sample mean is very close to the population mean (within about 0.085 standard deviations).

P-value: 0.932378405606689

The p-value indicates the probability of obtaining a test statistic at least as extreme as the one observed, assuming the null hypothesis is true.

A high p-value (typically greater than 0.05) suggests that there is no significant evidence to reject the null hypothesis. In this case, the p-value is approximately 0.932, which is very high. This indicates that the difference between the sample mean and the population mean is not statistically significant.

Conclusion:

Given the high p-value (0.932), we fail to reject the null hypothesis. This means there is not enough evidence to conclude that the sample mean of the sepal lengths of Iris-setosa is significantly different from the known population mean of 5.0.

Q: What is the null hypothesis for the one-sample Z-test?

A: The null hypothesis (H₀) states that the sample mean is equal to the population mean (x = μ).

T-Test

One-sample T-test

The one-sample T-test tests if the sample mean (x) is significantly different from a known population mean (μ) when the population standard deviation is unknown.

T = ^{(x - μ)}/_(s/√n)

Where:

x = sample mean
μ = population mean
s = sample standard deviation
n = sample size

# One-sample T-test
t_stat, p_value = stats.ttest_1samp(data_setosa, popmean=mu)
print(f"T-statistic: {t_stat}, P-value: {p_value}")

Output:

T-statistic: 0.12036212238318053, P-value: 0.9046884777690936

T-statistic: 0.12036212238318053

Meaning:

The T-statistic represents how many standard errors the sample mean is away from the population mean under the null hypothesis. Value Interpretation: A T-statistic of 0.12 is very close to 0, indicating that the sample mean is very close to the population mean. This means that the observed difference between the sample mean and the population mean is very small compared to the variability in the sample data.

P-value: 0.9046884777690936

Meaning:

The p-value represents the probability of observing a T-statistic as extreme as (or more extreme than) the one obtained if the null hypothesis is true. Value Interpretation: A p-value of approximately 0.905 is very high. This suggests that the observed T-statistic is consistent with the null hypothesis, and there is a high probability of observing such a result if the null hypothesis were true.

Conclusion:

The sample mean is equal to the population mean.

Alternative Hypothesis (H1):

The sample mean is not equal to the population mean. Given the high p-value (0.905), we fail to reject the null hypothesis. This means that there is not enough evidence to conclude that the sample mean is significantly different from the population mean. In other words, the sample mean is statistically similar to the population mean, considering the sample's variability.

Q: What is the null hypothesis for the one-sample T-test?

A: The null hypothesis (H₀) states that the sample mean is equal to the population mean (x = μ).

Chi-Square Test

Chi-square Test for Independence

The Chi-square test for independence tests if two categorical variables are independent.

χ² = ^{Σ(O_i - E_i)²}/_{E_i}

Where:

O_i = observed frequency
E_i = expected frequency

import pandas as pd
from scipy.stats import chi2_contingency

# Load the Iris dataset
data = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')

data.head()

data.head()

sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

data['Petal Width Category'] = pd.cut(data['petal_width'], bins=[0, 0.5, 1, 1.5, 2.5], labels=['Very Narrow', 'Narrow', 'Medium', 'Wide'])
contingency_table = pd.crosstab(data['Petal Width Category'], data['species'])
print(contingency_table.to_string())

Contingency Table

Petal Width Category	setosa	versicolor	virginica
Very Narrow (0 - 0.5)	49	0	0
Narrow (0.5 - 1)	1	7	0
Medium (1 - 1.5)	0	38	3
Wide (1.5 - 2.5)	0	5	47

# Perform Chi-square test for independence
chi2_stat, p_value, dof, expected = chi2_contingency(contingency_table)

print(f"\nChi-square Statistic: {chi2_stat}")
print(f"P-value: {p_value}")
print(f"Degrees of Freedom: {dof}")
print(f"Expected Frequencies Table:\n{pd.DataFrame(expected, index=contingency_table.index, columns=contingency_table.columns)}")

Output:

Chi-square Statistic: 250.95168855534712

P-value: 2.5677074777536407e-51

Degrees of Freedom: 6

Expected Frequencies Table:

Petal Width Category	setosa	versicolor	virginica
Very Narrow	16.333333	16.333333	16.333333
Narrow	2.666667	2.666667	2.666667
Medium	13.666667	13.666667	13.666667
Wide	17.333333	17.333333	17.333333

Chi-square Statistic (250.95):

A large value indicates a strong deviation between observed and expected frequencies.

P-value (2.57e-51):

The very small p-value suggests that the null hypothesis (no association) can be rejected, indicating a significant association between petal width category and species.

Degrees of Freedom (6):

Calculated as (number of rows - 1) * (number of columns - 1).

Expected Frequencies:

These values show what the counts would be if there were no association between the categories. The significant differences between observed and expected frequencies support the strong association between petal width category and species indicated by the Chi-square test.

Conclusion:

The analysis reveals a strong and statistically significant association between Petal Width Category and species, indicating that the petal width of iris flowers varies significantly between species.

Q: What is the null hypothesis for the Chi-square test for independence?

A: The null hypothesis (H₀) states that the two categorical variables are independent.

ANOVA

One-way ANOVA

The one-way ANOVA tests if there are significant differences between the means of three or more independent groups. In the context of the Iris dataset, a one-way ANOVA (Analysis of Variance) can be used to determine if there are significant differences in the means of a particular measurement (such as sepal length, sepal width, petal length, or petal width) across the three different species of iris (setosa, versicolor, and virginica).

Purpose: To test the null hypothesis that the means of several groups are equal.

Groups: In this context, the groups are the different species of iris (setosa, versicolor, and virginica).

Variable: The measurement of interest, such as sepal length.

Steps to Perform One-way ANOVA on the Iris Dataset

Formulate Hypotheses:

Null Hypothesis (H₀): The means of sepal length for setosa, versicolor, and virginica are equal.

Alternative Hypothesis (H₁): At least one species has a different mean sepal length.

Calculate the ANOVA:

Between-group variability: Variability due to interaction between the groups.

Within-group variability: Variability within each group.

ANOVA F-statistic:

Formula: F = ^MS_between/_{MS_within}

Where:

MS_between = mean square (variability) between groups
MS_within = mean square (variability) within groups

If the F-statistic is significantly high, the null hypothesis can be rejected.

# CODE 1
import pandas as pd
import scipy.stats as stats

# Load the Iris dataset
data = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(
    data[data['species'] == 'setosa']['sepal_length'],
    data[data['species'] == 'versicolor']['sepal_length'],
    data[data['species'] == 'virginica']['sepal_length']
)

print(f"F-statistic: {f_statistic}")
print(f"P-value: {p_value}")    

# CODE EXPLANATION 

# Importing Libraries:
# pandas: A powerful data manipulation library in Python.
# scipy.stats: A module that contains a large number of probability distributions as well as a growing library of statistical functions.

# Loading the Dataset:
# pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv'): This function loads the Iris dataset from a URL into a pandas DataFrame called data.

# Performing One-way ANOVA:
# stats.f_oneway(...): This function performs a one-way ANOVA test. It takes the sepal lengths of the three species (setosa, versicolor, and virginica) as input and compares their means.
# data[data['species'] == 'setosa']['sepal_length']: This extracts the sepal lengths of the setosa species.
# data[data['species'] == 'versicolor']['sepal_length']: This extracts the sepal lengths of the versicolor species.
#  data[data['species'] == 'virginica']['sepal_length']: This extracts the sepal lengths of the virginica species.

# Printing the Results:
# print(f"F-statistic: {f_statistic}"): This prints the F-statistic value.
# print(f"P-value: {p_value}"): This prints the p-value.

Output:

F-statistic: 119.26450218450468

P-value: 1.6696691907693826e-31

F-statistic: 119.26: This value indicates the ratio of the variability between the group means to the variability within the groups. A higher value indicates a greater difference between group means relative to the within-group variability.

P-value: 1.669669e-31: Given the very small p-value (much smaller than the typical significance level of 0.05), we reject the null hypothesis. This strong evidence suggests that there are significant differences in the mean sepal lengths among the three species of iris.

# CODE 2
from statsmodels.stats.anova import anova_lm
import statsmodels.api as sm
from statsmodels.formula.api import ols

#CODE EXPLANATION
# One-way ANOVA
model = ols('sepal_length ~ species', data=data).fit()
anova_result = anova_lm(model)
print(anova_result)

#CODE EXPLANATION

# Importing Libraries:
# anova_lm: This function is used to perform ANOVA on a linear model.
# statsmodels.api: This is the main namespace for statsmodels, a library used for statistical modeling.
# ols: This function is used to define the ordinary least squares (OLS) regression model.

# Defining the Model:
# 'sepal_length ~ species': This formula specifies the dependent variable (sepal_length) and the independent variable (species). The tilde (~) separates the dependent variable from the independent variable.
# data=data: This specifies that the data for the model is contained in the data DataFrame.
# ols(...): This function creates an OLS regression model based on the formula and data provided.
# .fit(): This method fits the OLS model to the data, estimating the model parameters.

#Performing ANOVA:
#anova_lm(model): This function performs ANOVA on the fitted OLS model to partition the variability in the data into components attributable to the independent variable (species) and residual error.

#Printing the Results:
#print(anova_result): This command outputs the results of the ANOVA, which includes the degrees of freedom (df), sum of squares (sum_sq), mean squares (mean_sq), F-statistic, and p-value.

Output:

	df	sum_sq	mean_sq	F	PR(>F)
species	2.0	63.212133	31.606067	119.264502	1.669669e-31
Residual	147.0	38.956200	0.265008	NaN	NaN

Degrees of Freedom (df):

For the species (between-group variability), the degrees of freedom are calculated as the number of groups minus one (3 species - 1 = 2).

For the Residual (within-group variability), the degrees of freedom are the total number of observations minus the number of groups (150 - 3 = 147).

Sum of Squares (sum_sq):

The sum_sq for species (63.212133) represents the total variation in sepal length that is explained by the differences between species.

The sum_sq for residual (38.956200) represents the total variation within each species.

Mean Square (mean_sq):

The mean_sq for species (31.606067) is calculated as the sum of squares divided by the degrees of freedom for species (63.212133 / 2).

The mean_sq for residual (0.265008) is calculated as the sum of squares divided by the degrees of freedom for residuals (38.956200 / 147).

F-statistic:

The F-statistic (119.264502) is the ratio of the mean square for species to the mean square for residuals (31.606067 / 0.265008). This high value indicates a significant difference between the means of the groups.

P-value:

The p-value (1.669669e-31) is extremely small, indicating that the probability of obtaining an F-statistic as extreme as the one observed, assuming the null hypothesis is true, is very low. This strong evidence leads to the rejection of the null hypothesis.

Q: What is the null hypothesis for the one-way ANOVA?

A: The null hypothesis (H₀) states that the means of the different groups are equal.