This tutorial covers various hypothesis testing methods using the Iris dataset. We will go through the following tests:
import pandas as pd # Load the Iris dataset data = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv') # Show the head of the dataset print(data.head())
The one-sample Z-test tests if the sample mean (x ) is significantly different from a known population mean (μ).
Where:
from scipy.stats import norm import numpy as np import pandas as pd # Load the Iris dataset data = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv') # Sample data: sepal length of Iris-setosa data_setosa = data[data['species'] == 'setosa']['sepal_length'] # Known population parameters mu = 5.0 sigma = 0.5 # Calculate the sample mean and sample size sample_mean = np.mean(data_setosa) n = len(data_setosa) # Calculate the Z-statistic z_stat = (sample_mean - mu) / (sigma / np.sqrt(n)) # Calculate the p-value p_value = 2 * (1 - norm.cdf(abs(z_stat))) print(f"Z-statistic: {z_stat}, P-value: {p_value}")
Output:
Z-statistic: 0.08485281374238891, P-value: 0.932378405606689
In our example, the sample mean is the average sepal length of the Iris-setosa species from the sample dataset. The population mean (𝜇) is given as 5.0.
The output of the Z-test provides two key values: the Z-statistic and the p-value.
Z-statistic: 0.08485281374238891
This value represents the number of standard deviations the sample mean is away from the population mean.
A Z-statistic close to 0 indicates that the sample mean is very close to the population mean. In this case, the Z-statistic is approximately 0.085, which means the sample mean is very close to the population mean (within about 0.085 standard deviations).
P-value: 0.932378405606689
The p-value indicates the probability of obtaining a test statistic at least as extreme as the one observed, assuming the null hypothesis is true.
A high p-value (typically greater than 0.05) suggests that there is no significant evidence to reject the null hypothesis. In this case, the p-value is approximately 0.932, which is very high. This indicates that the difference between the sample mean and the population mean is not statistically significant.
Conclusion:
Given the high p-value (0.932), we fail to reject the null hypothesis. This means there is not enough evidence to conclude that the sample mean of the sepal lengths of Iris-setosa is significantly different from the known population mean of 5.0.
Q: What is the null hypothesis for the one-sample Z-test?
A: The null hypothesis (H0) states that the sample mean is equal to the population mean (x = μ).
The one-sample T-test tests if the sample mean (x) is significantly different from a known population mean (μ) when the population standard deviation is unknown.
Where:
# One-sample T-test t_stat, p_value = stats.ttest_1samp(data_setosa, popmean=mu) print(f"T-statistic: {t_stat}, P-value: {p_value}")
Output:
T-statistic: 0.12036212238318053, P-value: 0.9046884777690936
T-statistic: 0.12036212238318053
Meaning:
The T-statistic represents how many standard errors the sample mean is away from the population mean under the null hypothesis. Value Interpretation: A T-statistic of 0.12 is very close to 0, indicating that the sample mean is very close to the population mean. This means that the observed difference between the sample mean and the population mean is very small compared to the variability in the sample data.
P-value: 0.9046884777690936
Meaning:
The p-value represents the probability of observing a T-statistic as extreme as (or more extreme than) the one obtained if the null hypothesis is true. Value Interpretation: A p-value of approximately 0.905 is very high. This suggests that the observed T-statistic is consistent with the null hypothesis, and there is a high probability of observing such a result if the null hypothesis were true.Conclusion:
The sample mean is equal to the population mean.
Alternative Hypothesis (H1):
The sample mean is not equal to the population mean. Given the high p-value (0.905), we fail to reject the null hypothesis. This means that there is not enough evidence to conclude that the sample mean is significantly different from the population mean. In other words, the sample mean is statistically similar to the population mean, considering the sample's variability.
Q: What is the null hypothesis for the one-sample T-test?
A: The null hypothesis (H0) states that the sample mean is equal to the population mean (x = μ).
The Chi-square test for independence tests if two categorical variables are independent.
Where:
import pandas as pd from scipy.stats import chi2_contingency # Load the Iris dataset data = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv') data.head()
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
data['Petal Width Category'] = pd.cut(data['petal_width'], bins=[0, 0.5, 1, 1.5, 2.5], labels=['Very Narrow', 'Narrow', 'Medium', 'Wide']) contingency_table = pd.crosstab(data['Petal Width Category'], data['species']) print(contingency_table.to_string())
Petal Width Category | setosa | versicolor | virginica |
---|---|---|---|
Very Narrow (0 - 0.5) | 49 | 0 | 0 |
Narrow (0.5 - 1) | 1 | 7 | 0 |
Medium (1 - 1.5) | 0 | 38 | 3 |
Wide (1.5 - 2.5) | 0 | 5 | 47 |
# Perform Chi-square test for independence chi2_stat, p_value, dof, expected = chi2_contingency(contingency_table) print(f"\nChi-square Statistic: {chi2_stat}") print(f"P-value: {p_value}") print(f"Degrees of Freedom: {dof}") print(f"Expected Frequencies Table:\n{pd.DataFrame(expected, index=contingency_table.index, columns=contingency_table.columns)}")
Output:
Chi-square Statistic: 250.95168855534712
P-value: 2.5677074777536407e-51
Degrees of Freedom: 6
Expected Frequencies Table:
Petal Width Category | setosa | versicolor | virginica |
---|---|---|---|
Very Narrow | 16.333333 | 16.333333 | 16.333333 |
Narrow | 2.666667 | 2.666667 | 2.666667 |
Medium | 13.666667 | 13.666667 | 13.666667 |
Wide | 17.333333 | 17.333333 | 17.333333 |
Chi-square Statistic (250.95):
A large value indicates a strong deviation between observed and expected frequencies.
P-value (2.57e-51):
The very small p-value suggests that the null hypothesis (no association) can be rejected, indicating a significant association between petal width category and species.
Degrees of Freedom (6):
Calculated as (number of rows - 1) * (number of columns - 1).
Expected Frequencies:
These values show what the counts would be if there were no association between the categories. The significant differences between observed and expected frequencies support the strong association between petal width category and species indicated by the Chi-square test.
Conclusion:
The analysis reveals a strong and statistically significant association between Petal Width Category and species, indicating that the petal width of iris flowers varies significantly between species.
Q: What is the null hypothesis for the Chi-square test for independence?
A: The null hypothesis (H0) states that the two categorical variables are independent.
The one-way ANOVA tests if there are significant differences between the means of three or more independent groups. In the context of the Iris dataset, a one-way ANOVA (Analysis of Variance) can be used to determine if there are significant differences in the means of a particular measurement (such as sepal length, sepal width, petal length, or petal width) across the three different species of iris (setosa, versicolor, and virginica).
Purpose: To test the null hypothesis that the means of several groups are equal.
Groups: In this context, the groups are the different species of iris (setosa, versicolor, and virginica).
Variable: The measurement of interest, such as sepal length.
Formulate Hypotheses:
Null Hypothesis (H0): The means of sepal length for setosa, versicolor, and virginica are equal.
Alternative Hypothesis (H1): At least one species has a different mean sepal length.
Calculate the ANOVA:
Between-group variability: Variability due to interaction between the groups.
Within-group variability: Variability within each group.
ANOVA F-statistic: Where:
If the F-statistic is significantly high, the null hypothesis can be rejected.
# CODE 1 import pandas as pd import scipy.stats as stats # Load the Iris dataset data = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv') # Perform one-way ANOVA f_statistic, p_value = stats.f_oneway( data[data['species'] == 'setosa']['sepal_length'], data[data['species'] == 'versicolor']['sepal_length'], data[data['species'] == 'virginica']['sepal_length'] ) print(f"F-statistic: {f_statistic}") print(f"P-value: {p_value}") # CODE EXPLANATION # Importing Libraries: # pandas: A powerful data manipulation library in Python. # scipy.stats: A module that contains a large number of probability distributions as well as a growing library of statistical functions. # Loading the Dataset: # pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv'): This function loads the Iris dataset from a URL into a pandas DataFrame called data. # Performing One-way ANOVA: # stats.f_oneway(...): This function performs a one-way ANOVA test. It takes the sepal lengths of the three species (setosa, versicolor, and virginica) as input and compares their means. # data[data['species'] == 'setosa']['sepal_length']: This extracts the sepal lengths of the setosa species. # data[data['species'] == 'versicolor']['sepal_length']: This extracts the sepal lengths of the versicolor species. # data[data['species'] == 'virginica']['sepal_length']: This extracts the sepal lengths of the virginica species. # Printing the Results: # print(f"F-statistic: {f_statistic}"): This prints the F-statistic value. # print(f"P-value: {p_value}"): This prints the p-value.
Output:
F-statistic: 119.26450218450468
P-value: 1.6696691907693826e-31
F-statistic: 119.26: This value indicates the ratio of the variability between the group means to the variability within the groups. A higher value indicates a greater difference between group means relative to the within-group variability.
P-value: 1.669669e-31: Given the very small p-value (much smaller than the typical significance level of 0.05), we reject the null hypothesis. This strong evidence suggests that there are significant differences in the mean sepal lengths among the three species of iris.
# CODE 2 from statsmodels.stats.anova import anova_lm import statsmodels.api as sm from statsmodels.formula.api import ols #CODE EXPLANATION # One-way ANOVA model = ols('sepal_length ~ species', data=data).fit() anova_result = anova_lm(model) print(anova_result) #CODE EXPLANATION # Importing Libraries: # anova_lm: This function is used to perform ANOVA on a linear model. # statsmodels.api: This is the main namespace for statsmodels, a library used for statistical modeling. # ols: This function is used to define the ordinary least squares (OLS) regression model. # Defining the Model: # 'sepal_length ~ species': This formula specifies the dependent variable (sepal_length) and the independent variable (species). The tilde (~) separates the dependent variable from the independent variable. # data=data: This specifies that the data for the model is contained in the data DataFrame. # ols(...): This function creates an OLS regression model based on the formula and data provided. # .fit(): This method fits the OLS model to the data, estimating the model parameters. #Performing ANOVA: #anova_lm(model): This function performs ANOVA on the fitted OLS model to partition the variability in the data into components attributable to the independent variable (species) and residual error. #Printing the Results: #print(anova_result): This command outputs the results of the ANOVA, which includes the degrees of freedom (df), sum of squares (sum_sq), mean squares (mean_sq), F-statistic, and p-value.
Output:
df | sum_sq | mean_sq | F | PR(>F) | |
---|---|---|---|---|---|
species | 2.0 | 63.212133 | 31.606067 | 119.264502 | 1.669669e-31 |
Residual | 147.0 | 38.956200 | 0.265008 | NaN | NaN |
Degrees of Freedom (df):
For the species (between-group variability), the degrees of freedom are calculated as the number of groups minus one (3 species - 1 = 2).
For the Residual (within-group variability), the degrees of freedom are the total number of observations minus the number of groups (150 - 3 = 147).
Sum of Squares (sum_sq):
The sum_sq for species (63.212133) represents the total variation in sepal length that is explained by the differences between species.
The sum_sq for residual (38.956200) represents the total variation within each species.
Mean Square (mean_sq):
The mean_sq for species (31.606067) is calculated as the sum of squares divided by the degrees of freedom for species (63.212133 / 2).
The mean_sq for residual (0.265008) is calculated as the sum of squares divided by the degrees of freedom for residuals (38.956200 / 147).
F-statistic:
The F-statistic (119.264502) is the ratio of the mean square for species to the mean square for residuals (31.606067 / 0.265008). This high value indicates a significant difference between the means of the groups.
P-value:
The p-value (1.669669e-31) is extremely small, indicating that the probability of obtaining an F-statistic as extreme as the one observed, assuming the null hypothesis is true, is very low. This strong evidence leads to the rejection of the null hypothesis.
Q: What is the null hypothesis for the one-way ANOVA?
A: The null hypothesis (H0) states that the means of the different groups are equal.