pandas is an implementation of data frames in Python - it takes much of its inspiration from R and NumPy.
pandas aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language.
Key features:
DataFrame and Series (column) object classes
Reading and writing tabular data
Data munging (filtering, grouping, summarizing, joining, etc.)
Data reshaping
DataFrame
Just like R a DataFrame is a collection of vectors with a common length
Series objects are compatible with NumPy like functions (i.e. vectorized)
t = pd.Series([4,2,1,3], index=["a","b","c","d"])
t +1
a 5
b 3
c 2
d 4
dtype: int64
t /2+1
a 3.0
b 2.0
c 1.5
d 2.5
dtype: float64
np.log(t)
a 1.386294
b 0.693147
c 0.000000
d 1.098612
dtype: float64
np.exp(-t**2/2)
a 0.000335
b 0.135335
c 0.606531
d 0.011109
dtype: float64
Series indexing
Series can be indexed in the same was as NumPy arrays with the addition of being able to use index label(s) when selecting elements.
t = pd.Series([4,2,1,3], index=["a","b","c","d"])
t[1]
2
t[[1,2]]
b 2
c 1
dtype: int64
t["c"]
1
t[["a","d"]]
a 4
d 3
dtype: int64
t[t ==3]
d 3
dtype: int64
t[t %2==0]
a 4
b 2
dtype: int64
t["d"] =6t
a 4
b 2
c 1
d 6
dtype: int64
Index alignment
When performing operations with multiple series, generally pandas will attempt to align the operation by the index values,
m = pd.Series([1,2,3,4], index = ["a","b","c","d"])n = pd.Series([4,3,2,1], index = ["d","c","b","a"])o = pd.Series([1,1,1,1,1], index = ["b","d","a","c","e"])
m + n
a 2
b 4
c 6
d 8
dtype: int64
n + m
a 2
b 4
c 6
d 8
dtype: int64
n + o
a 2.0
b 3.0
c 4.0
d 5.0
e NaN
dtype: float64
Series and dicts
Series can also be constructed from dictionaries, in which case the keys are used as the index,
d = {"anna": "A+", "bob": "B-", "carol": "C", "dave": "D+"}pd.Series(d)
anna A+
bob B-
carol C
dave D+
dtype: object
Index order will follow key order, unless overriden by index,
pd.Series(d, index = ["dave","carol","bob","anna"])
dave D+
carol C
bob B-
anna A+
dtype: object
Missing values
Pandas encodes missing values using NaN (mostly),
s = pd.Series( {"anna": "A+", "bob": "B-", "carol": "C", "dave": "D+"}, index = ["erin","dave","carol","bob","anna"])
s
erin NaN
dave D+
carol C
bob B-
anna A+
dtype: object
pd.isna(s)
erin True
dave False
carol False
bob False
anna False
dtype: bool
s = pd.Series( {"anna": 97, "bob": 82, "carol": 75, "dave": 68}, index = ["erin","dave","carol","bob","anna"], dtype ='int64')
s
erin NaN
dave 68.0
carol 75.0
bob 82.0
anna 97.0
dtype: float64
pd.isna(s)
erin True
dave False
carol False
bob False
anna False
dtype: bool
Aside - why np.isna()?
s = pd.Series([1,2,3,None])s
0 1.0
1 2.0
2 3.0
3 NaN
dtype: float64
pd.isna(s)
0 False
1 False
2 False
3 True
dtype: bool
s == np.nan
0 False
1 False
2 False
3 False
dtype: bool
np.nan == np.nan
False
np.nan != np.nan
True
np.isnan(np.nan)
True
np.isnan(0)
False
Native NAs
Recent versions of pandas have attempted to adopt a more native missing value, particularly for integer and boolean types,
Earlier we saw how to read a DataFrame via read_csv(), but data frames can also be constructed via DataFrame(), in general this is done using a dictionary of columns:
id weight height date
anna 780 58.244369 180.317131 2022-02-01
bob 675 63.680724 185.107711 2022-02-02
carol 164 48.476114 180.108224 2022-02-03
dave 531 32.834607 162.617524 2022-02-04
erin 463 67.902511 172.070975 2022-02-05
df.loc["anna"]
id 780
weight 58.244369
height 180.317131
date 2022-02-01 00:00:00
Name: anna, dtype: object
df.loc[["anna"]]
id weight height date
anna 780 58.244369 180.317131 2022-02-01
df.loc["bob":"dave"]
id weight height date
bob 675 63.680724 185.107711 2022-02-02
carol 164 48.476114 180.108224 2022-02-03
dave 531 32.834607 162.617524 2022-02-04
df.loc[df.id<300]
id weight height date
carol 164 48.476114 180.108224 2022-02-03
df.loc[:, "date"]
anna 2022-02-01
bob 2022-02-02
carol 2022-02-03
dave 2022-02-04
erin 2022-02-05
Name: date, dtype: datetime64[ns]
df.loc[["bob","erin"], "weight":"height"]
weight height
bob 63.680724 185.107711
erin 67.902511 172.070975
df.loc[0:2, "weight":"height"]
Error: TypeError: cannot do slice indexing on Index with these indexers [0] of type int
Views vs. Copies
In general most pandas operations will generate a new object but some will return views, mostly the later occurs with subsetting.
d = pd.DataFrame(np.arange(6).reshape(3,2), columns = ["x","y"])d
x y
0 0 1
1 2 3
2 4 5
v = d.iloc[0:2,0:2]v
x y
0 0 1
1 2 3
d.iloc[0,1] =-1v
x y
0 0 -1
1 2 3
v.iloc[0,0] = np.piv
x y
0 3.141593 -1
1 2.000000 3
d
x y
0 0 -1
1 2 3
2 4 5
Filtering rows
The query() method can be used for filtering rows, it evaluates a string expression in the context of the data frame.
df.query('date == "2022-02-01"')
id weight height date
anna 780 58.244369 180.317131 2022-02-01
df.query('weight > 50')
id weight height date
anna 780 58.244369 180.317131 2022-02-01
bob 675 63.680724 185.107711 2022-02-02
erin 463 67.902511 172.070975 2022-02-05
id weight height date
anna 780 58.244369 180.317131 2022-02-01
bob 675 63.680724 185.107711 2022-02-02
carol 164 48.476114 180.108224 2022-02-03
dave 531 32.834607 162.617524 2022-02-04
erin 463 67.902511 172.070975 2022-02-05
df[0,0]
Error: KeyError: (0, 0)
df.iat[0,0]
780
df.id[0]
780
df[0:1].id[0]
780
df["anna", "id"]
Error: KeyError: ('anna', 'id')
df.at["anna", "id"]
780
df["id"]["anna"]
780
df["id"][0]
780
DataFrame properties
df.size
20
df.shape
(5, 4)
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, anna to erin
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 5 non-null int64
1 weight 5 non-null float64
2 height 5 non-null float64
3 date 5 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(2), int64(1)
memory usage: 372.0+ bytes
df.dtypes
id int64
weight float64
height float64
date datetime64[ns]
dtype: object
df.describe()
id weight height
count 5.000000 5.000000 5.000000
mean 522.600000 54.227665 176.044313
std 235.423661 13.993203 8.845672
min 164.000000 32.834607 162.617524
25% 463.000000 48.476114 172.070975
50% 531.000000 58.244369 180.108224
75% 675.000000 63.680724 180.317131
max 780.000000 67.902511 185.107711
Selecting Columns
Beyond the use of loc() and iloc() there is also the filter() method which can be used to select columns (or indices) by name with pattern matching
df.filter(items=["id","weight"])
id weight
anna 780 58.244369
bob 675 63.680724
carol 164 48.476114
dave 531 32.834607
erin 463 67.902511
df.filter(like ="i")
id weight height
anna 780 58.244369 180.317131
bob 675 63.680724 185.107711
carol 164 48.476114 180.108224
dave 531 32.834607 162.617524
erin 463 67.902511 172.070975
df.filter(regex="ght$")
weight height
anna 58.244369 180.317131
bob 63.680724 185.107711
carol 48.476114 180.108224
dave 32.834607 162.617524
erin 67.902511 172.070975
df.filter(like="o", axis=0)
id weight height date
bob 675 63.680724 185.107711 2022-02-02
carol 164 48.476114 180.108224 2022-02-03
Adding columns
Indexing with assignment allows for inplace modification of a DataFrame, while assign() creates a new object (but is chainable)
id weight height date student age
anna 780 58.244369 180.317131 2022-02-01 True 19.0
bob 675 63.680724 185.107711 2022-02-02 True 22.0
carol 164 48.476114 180.108224 2022-02-03 True 25.0
dave 531 32.834607 162.617524 2022-02-04 False NaN
erin 463 67.902511 172.070975 2022-02-05 None NaN
id weight height date student age rand
anna 780 58.244369 180.317131 2022-02-01 yes 19.0 0.046582
bob 675 63.680724 185.107711 2022-02-02 yes 22.0 0.644103
carol 164 48.476114 180.108224 2022-02-03 yes 25.0 0.077900
dave 531 32.834607 162.617524 2022-02-04 no NaN 0.309793
erin 463 67.902511 172.070975 2022-02-05 no NaN 0.416640
df
id weight height date student age
anna 780 58.244369 180.317131 2022-02-01 True 19.0
bob 675 63.680724 185.107711 2022-02-02 True 22.0
carol 164 48.476114 180.108224 2022-02-03 True 25.0
dave 531 32.834607 162.617524 2022-02-04 False NaN
erin 463 67.902511 172.070975 2022-02-05 None NaN
Removing columns (and rows)
Columns can be dropped via the drop() method,
df.drop(['student'])
Error: KeyError: "['student'] not found in axis"
df.drop(['student'], axis=1)
id weight height date age
anna 780 58.244369 180.317131 2022-02-01 19.0
bob 675 63.680724 185.107711 2022-02-02 22.0
carol 164 48.476114 180.108224 2022-02-03 25.0
dave 531 32.834607 162.617524 2022-02-04 NaN
erin 463 67.902511 172.070975 2022-02-05 NaN
df.drop(['anna','dave'])
id weight height date student age
bob 675 63.680724 185.107711 2022-02-02 True 22.0
carol 164 48.476114 180.108224 2022-02-03 True 25.0
erin 463 67.902511 172.070975 2022-02-05 None NaN
df.drop(columns = df.columns =="age")
Error: KeyError: '[False, False, False, False, False, True] not found in axis'
df.drop(columns = df.columns[df.columns =="age"])
id weight height date student
anna 780 58.244369 180.317131 2022-02-01 True
bob 675 63.680724 185.107711 2022-02-02 True
carol 164 48.476114 180.108224 2022-02-03 True
dave 531 32.834607 162.617524 2022-02-04 False
erin 463 67.902511 172.070975 2022-02-05 None
id date student age
anna 780 2022-02-01 True 19.0
bob 675 2022-02-02 True 22.0
carol 164 2022-02-03 True 25.0
dave 531 2022-02-04 False NaN
erin 463 2022-02-05 None NaN
Dropping missing values
Columns can be dropped via the drop() method,
df
id weight height date student age
anna 780 58.244369 180.317131 2022-02-01 True 19.0
bob 675 63.680724 185.107711 2022-02-02 True 22.0
carol 164 48.476114 180.108224 2022-02-03 True 25.0
dave 531 32.834607 162.617524 2022-02-04 False NaN
erin 463 67.902511 172.070975 2022-02-05 None NaN
df.dropna()
id weight height date student age
anna 780 58.244369 180.317131 2022-02-01 True 19.0
bob 675 63.680724 185.107711 2022-02-02 True 22.0
carol 164 48.476114 180.108224 2022-02-03 True 25.0
df.dropna(how="all")
id weight height date student age
anna 780 58.244369 180.317131 2022-02-01 True 19.0
bob 675 63.680724 185.107711 2022-02-02 True 22.0
carol 164 48.476114 180.108224 2022-02-03 True 25.0
dave 531 32.834607 162.617524 2022-02-04 False NaN
erin 463 67.902511 172.070975 2022-02-05 None NaN
df.dropna(axis=1)
id weight height date
anna 780 58.244369 180.317131 2022-02-01
bob 675 63.680724 185.107711 2022-02-02
carol 164 48.476114 180.108224 2022-02-03
dave 531 32.834607 162.617524 2022-02-04
erin 463 67.902511 172.070975 2022-02-05
df.dropna(axis=1, thresh=4)
id weight height date student
anna 780 58.244369 180.317131 2022-02-01 True
bob 675 63.680724 185.107711 2022-02-02 True
carol 164 48.476114 180.108224 2022-02-03 True
dave 531 32.834607 162.617524 2022-02-04 False
erin 463 67.902511 172.070975 2022-02-05 None
Sorting
DataFrames can be sorted on one or more columns via sort_values(),
df
id weight height date student age
anna 780 58.244369 180.317131 2022-02-01 True 19.0
bob 675 63.680724 185.107711 2022-02-02 True 22.0
carol 164 48.476114 180.108224 2022-02-03 True 25.0
dave 531 32.834607 162.617524 2022-02-04 False NaN
erin 463 67.902511 172.070975 2022-02-05 None NaN
id weight height date student age
dave 531 32.834607 162.617524 2022-02-04 False NaN
anna 780 58.244369 180.317131 2022-02-01 True 19.0
bob 675 63.680724 185.107711 2022-02-02 True 22.0
carol 164 48.476114 180.108224 2022-02-03 True 25.0
erin 463 67.902511 172.070975 2022-02-05 None NaN
Row binds
DataFrames can have their rows joined via the the concat() function (append() is also available but deprecated),
x y b z
0 0.0 1.0 NaN NaN
1 2.0 3.0 NaN NaN
2 4.0 5.0 NaN NaN
0 1.0 NaN 1.0 1.0
1 1.0 NaN 1.0 1.0
2 1.0 NaN 1.0 1.0
0 12.0 11.0 NaN NaN
1 10.0 9.0 NaN NaN
2 8.0 7.0 NaN NaN
Column binds
Similarly, columns can be joined with concat() where axis=1,
All three can be used to accomplish the same thing, in terms of “column bind” type operations.
concat() stacks DataFrames on either axis, with basic alignment based on (row) indexes. join argument only supports “inner” and “outer”.
merge() aligns based on one or more shared columns. how supports “inner”, “outer”, “left”, “right”, and “cross”.
join() uses merge() behind the scenes, but prefers to join based on (row) indexes. Also has different default how compared to merge(), “left” vs “inner”.
groupby and agg
Groups can be created within a DataFrame via groupby() - these groups are then used by the standard summary methods (e.g. sum(), mean(), std(), etc.).
df.groupby("student")
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x1701554b0>
df.groupby("student").groups
{False: ['dave'], True: ['anna', 'bob', 'carol']}
df.groupby("student").mean(numeric_only=True)
id weight height age
student
False 531.000000 32.834607 162.617524 NaN
True 539.666667 56.800402 181.844355 22.0
df.groupby("student", dropna=False).groups
Error: ValueError: Categorical categories cannot be null
id weight height age
student
False 531.000000 32.834607 162.617524 NaN
True 539.666667 56.800402 181.844355 22.0
NaN 463.000000 67.902511 172.070975 NaN
Selecting groups
df
id weight height date student age
anna 780 58.244369 180.317131 2022-02-01 True 19.0
bob 675 63.680724 185.107711 2022-02-02 True 22.0
carol 164 48.476114 180.108224 2022-02-03 True 25.0
dave 531 32.834607 162.617524 2022-02-04 False NaN
erin 463 67.902511 172.070975 2022-02-05 None NaN
df.groupby("student").get_group(True)
id weight height date student age
anna 780 58.244369 180.317131 2022-02-01 True 19.0
bob 675 63.680724 185.107711 2022-02-02 True 22.0
carol 164 48.476114 180.108224 2022-02-03 True 25.0
df.groupby("student").get_group(False)
id weight height date student age
dave 531 32.834607 162.617524 2022-02-04 False NaN
id weight height age
student
False 531.000000 32.834607 162.617524 NaN
True 539.666667 56.800402 181.844355 22.0
df.groupby("student").agg([np.mean, np.std])
id weight ... height age
mean std mean ... std mean std
student ...
False 531.000000 NaN 32.834607 ... NaN NaN NaN
True 539.666667 329.545647 56.800402 ... 2.828079 22.0 3.0
[2 rows x 8 columns]