Pandas is a powerful data manipulation and analysis library for Python. It provides data structures like DataFrame and Series which are essential for data analysis.
pip install pandas
import pandas as pd
You can create DataFrames from various data structures like dictionaries, lists, and more.
import pandas as pd
data = {
'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 24, 35, 32],
'City': ['New York', 'Paris', 'Berlin', 'London']
}
df = pd.DataFrame(data)
print(df)
Name Age City
0 John 28 New York
1 Anna 24 Paris
2 Peter 35 Berlin
3 Linda 32 London
# Reading a CSV file
df_from_csv = pd.read_csv('data.csv')
print(df_from_csv)
# Output will depend on the contents of 'data.csv'
References:
# Writing to a CSV file
df.to_csv('output.csv', index=False)
# 'output.csv' will now contain the data from the DataFrame
References:
# Reading multiple CSV files
df1 = pd.read_csv('data1.csv')
df2 = pd.read_csv('data2.csv')
# Selecting specific columns
df1_selected = df1[['Name', 'Age']]
df2_selected = df2[['City', 'Salary']]
# Combining DataFrames
combined_df = pd.concat([df1_selected, df2_selected], axis=1)
print(combined_df)
# Output will depend on the contents of 'data1.csv' and 'data2.csv'
References:
print(df.head())
Name Age City
0 John 28 New York
1 Anna 24 Paris
2 Peter 35 Berlin
3 Linda 32 London
print(df.info())
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 4 non-null object
1 Age 4 non-null int64
2 City 4 non-null object
dtypes: int64(1), object(2)
memory usage: 224.0+ bytes
None
print(df.describe())
Age
count 4.000000
mean 29.750000
std 4.573474
min 24.000000
25% 26.500000
50% 30.000000
75% 33.250000
max 35.000000
References:
print(df[df['Age'] > 30])
Name Age City
2 Peter 35 Berlin
3 Linda 32 London
References:
df['Salary'] = [70000, 65000, 60000, 72000]
print(df)
Name Age City Salary
0 John 28 New York 70000
1 Anna 24 Paris 65000
2 Peter 35 Berlin 60000
3 Linda 32 London 72000
References:
print(df.isnull().sum())
Name 0
Age 0
City 0
Salary 0
dtype: int64
References:
# Assuming some missing data
df.loc[1, 'Salary'] = None
print(df.fillna(0))
Name Age City Salary
0 John 28 New York 70000.0
1 Anna 24 Paris 0.0
2 Peter 35 Berlin 60000.0
3 Linda 32 London 72000.0
References:
grouped = df.groupby('City').mean()
print(grouped)
Age Salary
City
Berlin 35.0 60000.0
London 32.0 72000.0
New York 28.0 70000.0
Paris 24.0 0.0
References:
A: You can select a single column using the bracket notation or dot notation. For example:
print(df['Name'])
print(df.Name)
0 John
1 Anna
2 Peter
3 Linda
Name: Name, dtype: object
0 John
1 Anna
2 Peter
3 Linda
Name: Name, dtype: object
References:
A: You can drop a column using the drop
method. For example:
df_dropped = df.drop('City', axis=1)
print(df_dropped)
Name Age Salary
0 John 28 70000.0
1 Anna 24 NaN
2 Peter 35 60000.0
3 Linda 32 72000.0
References:
A: You can sort a DataFrame by a column using the sort_values
method. For example:
sorted_df = df.sort_values(by='Age')
print(sorted_df)
Name Age City Salary
1 Anna 24 Paris NaN
0 John 28 New York 70000.0
3 Linda 32 London 72000.0
2 Peter 35 Berlin 60000.0
References:
In this tutorial, we covered the basics of pandas including creating DataFrames, reading and writing CSV files, basic operations, data manipulation, and handling missing data. We also explored group by and aggregation operations and answered some common questions. Pandas is a versatile library that is essential for data analysis in Python.