Pandas Tutorial

Introduction to Pandas

Pandas is a powerful data manipulation and analysis library for Python. It provides data structures like DataFrame and Series which are essential for data analysis.

Installing Pandas

pip install pandas

Importing Pandas

import pandas as pd

Creating DataFrames

You can create DataFrames from various data structures like dictionaries, lists, and more.

From a Dictionary


import pandas as pd

data = {
    'Name': ['John', 'Anna', 'Peter', 'Linda'],
    'Age': [28, 24, 35, 32],
    'City': ['New York', 'Paris', 'Berlin', 'London']
}

df = pd.DataFrame(data)
print(df)
    

    Name  Age      City
0   John   28  New York
1   Anna   24     Paris
2  Peter   35    Berlin
3  Linda   32    London
        

Reading and Writing CSV Files

Reading CSV Files


# Reading a CSV file
df_from_csv = pd.read_csv('data.csv')
print(df_from_csv)
    

# Output will depend on the contents of 'data.csv'
        

References:

Creating CSV Files


# Writing to a CSV file
df.to_csv('output.csv', index=False)
    

# 'output.csv' will now contain the data from the DataFrame
        

References:

Working with Multiple CSV Files

Reading Multiple Files and Combining Columns


# Reading multiple CSV files
df1 = pd.read_csv('data1.csv')
df2 = pd.read_csv('data2.csv')

# Selecting specific columns
df1_selected = df1[['Name', 'Age']]
df2_selected = df2[['City', 'Salary']]

# Combining DataFrames
combined_df = pd.concat([df1_selected, df2_selected], axis=1)
print(combined_df)
    

# Output will depend on the contents of 'data1.csv' and 'data2.csv'
        

References:

DataFrame Operations

Viewing Data


print(df.head())
    

    Name  Age      City
0   John   28  New York
1   Anna   24     Paris
2  Peter   35    Berlin
3  Linda   32    London
        

Getting DataFrame Information


print(df.info())
    


RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    4 non-null      object
 1   Age     4 non-null      int64 
 2   City    4 non-null      object
dtypes: int64(1), object(2)
memory usage: 224.0+ bytes
None
        

Basic Data Analysis

Descriptive Statistics


print(df.describe())
    

             Age
count   4.000000
mean   29.750000
std     4.573474
min    24.000000
25%    26.500000
50%    30.000000
75%    33.250000
max    35.000000
        

References:

Data Manipulation

Filtering Data


print(df[df['Age'] > 30])
    

    Name  Age    City
2  Peter   35  Berlin
3  Linda   32  London
        

References:

Adding New Columns


df['Salary'] = [70000, 65000, 60000, 72000]
print(df)
    

    Name  Age      City  Salary
0   John   28  New York   70000
1   Anna   24     Paris   65000
2  Peter   35    Berlin   60000
3  Linda   32    London   72000
        

References:

Handling Missing Data

Checking for Missing Data


print(df.isnull().sum())
    

Name      0
Age       0
City      0
Salary    0
dtype: int64
        

References:

Filling Missing Data


# Assuming some missing data
df.loc[1, 'Salary'] = None
print(df.fillna(0))
    

    Name  Age      City   Salary
0   John   28  New York  70000.0
1   Anna   24     Paris      0.0
2  Peter   35    Berlin  60000.0
3  Linda   32    London  72000.0
        

References:

Group By and Aggregation

Grouping Data


grouped = df.groupby('City').mean()
print(grouped)
    

         Age   Salary
City                
Berlin   35.0  60000.0
London   32.0  72000.0
New York 28.0  70000.0
Paris    24.0      0.0
        

References:

Questions and Answers

Q: How do you select a single column in a DataFrame?

A: You can select a single column using the bracket notation or dot notation. For example:


print(df['Name'])
print(df.Name)
    

0     John
1     Anna
2    Peter
3    Linda
Name: Name, dtype: object

0     John
1     Anna
2    Peter
3    Linda
Name: Name, dtype: object
        

References:

Q: How do you drop a column from a DataFrame?

A: You can drop a column using the drop method. For example:


df_dropped = df.drop('City', axis=1)
print(df_dropped)
    

    Name  Age  Salary
0   John   28  70000.0
1   Anna   24      NaN
2  Peter   35  60000.0
3  Linda   32  72000.0
        

References:

Q: How do you sort a DataFrame by a column?

A: You can sort a DataFrame by a column using the sort_values method. For example:


sorted_df = df.sort_values(by='Age')
print(sorted_df)
    

    Name  Age      City   Salary
1   Anna   24     Paris      NaN
0   John   28  New York  70000.0
3  Linda   32    London  72000.0
2  Peter   35    Berlin  60000.0
        

References:

Conclusion

In this tutorial, we covered the basics of pandas including creating DataFrames, reading and writing CSV files, basic operations, data manipulation, and handling missing data. We also explored group by and aggregation operations and answered some common questions. Pandas is a versatile library that is essential for data analysis in Python.