COS 102 / Week 08
Data analysis with Pandas
The DataFrame: building one from lists and dictionaries, selecting rows and columns, reading and writing CSV files, and looping over data.
- Subjects
- Pandas / DataFrames
- Builds on
- NumPy / Data types
Pandas is a library for data manipulation and analysis. Its main structure is the DataFrame: a two-dimensional, size-mutable table with labelled rows and columns, made of three parts, the data, the rows, and the columns.
import pandas as pd # the usual aliasCreating a DataFrame
From a list:
import pandas as pd
info = ["CSC", "102", "is", "the", "best", "course", "ever"]
df = pd.DataFrame(info)
dfFrom a dictionary of lists, where each key becomes a column:
import pandas as pd
data = {
"Name": ["Angela", "Precious", "Luis", "Ade"],
"Age": [20, 21, 19, 18],
}
df = pd.DataFrame(data)
dfSelecting columns and rows
Select columns by passing a list of their names:
data = {
"Name": ["Clem", "Prince", "Edward", "Adele"],
"Age": [27, 24, 22, 32],
"Address": ["Abuja", "Kano", "Minna", "Lagos"],
"Qualification": ["Msc", "MA", "MCA", "Phd"],
}
df = pd.DataFrame(data)
df[["Name", "Qualification", "Address"]]Select rows by position with iloc:
df.iloc[1] # the second row
df.iloc[1, 3] # second row, fourth columnReading and writing files
Read a CSV into a DataFrame, and write one back out:
import pandas as pd
data = pd.read_csv("employee_records.csv")
data.head(2) # first two rows with the header
record = {
"name": ["Abel", "Kamsi", "Oyode", "Chinelo"],
"degree": ["MBA", "BCA", "M.Tech", "MBA"],
"score": [90, 40, 80, 98],
}
pd.DataFrame(record).to_csv("record.csv")Looping over data
Iterate rows with iterrows:
for index, row in df.iterrows():
print(index, row["Name"], row["Age"])Iterate columns by looping over the list of column names:
for column in list(df):
print(column, df[column][0]) # column name and its first valueProjects
- From kaggle.com, download a dataset, then display its first seven rows, its first three columns, and a single row with the header.
- Model a company's product catalogue (segments and brands) as a DataFrame
and save it to
cadbury_market.csv.
Commit every practice cell and project to your GitHub repository.