DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index. Let’s use pandas to explore this topic!
import pandas as pdimport numpy as np
from numpy.random import randnnp.random.seed(101)
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())
df
W
X
Y
Z
A
2.706850
0.628133
0.907969
0.503826
B
0.651118
-0.319318
-0.848077
0.605965
C
-2.018168
0.740122
0.528813
-0.589001
D
0.188695
-0.758872
-0.933237
0.955057
E
0.190794
1.978757
2.605967
0.683509
Selection and Indexing
Let’s learn the various methods to grab data from a DataFrame
df['W']
A 2.706850
B 0.651118
C -2.018168
D 0.188695
E 0.190794
Name: W, dtype: float64
# Pass a list of column namesdf[['W','Z']]
W
Z
A
2.706850
0.503826
B
0.651118
0.605965
C
-2.018168
-0.589001
D
0.188695
0.955057
E
0.190794
0.683509
# SQL Syntax (NOT RECOMMENDED!)df.W
A 2.706850
B 0.651118
C -2.018168
D 0.188695
E 0.190794
Name: W, dtype: float64
DataFrame Columns are just Series
type(df['W'])
pandas.core.series.Series
Creating a new column:
df['new'] = df['W'] + df['Y']
df
W
X
Y
Z
new
A
2.706850
0.628133
0.907969
0.503826
3.614819
B
0.651118
-0.319318
-0.848077
0.605965
-0.196959
C
-2.018168
0.740122
0.528813
-0.589001
-1.489355
D
0.188695
-0.758872
-0.933237
0.955057
-0.744542
E
0.190794
1.978757
2.605967
0.683509
2.796762
Removing Columns
df.drop('new',axis=1)
W
X
Y
Z
A
2.706850
0.628133
0.907969
0.503826
B
0.651118
-0.319318
-0.848077
0.605965
C
-2.018168
0.740122
0.528813
-0.589001
D
0.188695
-0.758872
-0.933237
0.955057
E
0.190794
1.978757
2.605967
0.683509
# Not inplace unless specified!df
W
X
Y
Z
new
A
2.706850
0.628133
0.907969
0.503826
3.614819
B
0.651118
-0.319318
-0.848077
0.605965
-0.196959
C
-2.018168
0.740122
0.528813
-0.589001
-1.489355
D
0.188695
-0.758872
-0.933237
0.955057
-0.744542
E
0.190794
1.978757
2.605967
0.683509
2.796762
df.drop('new',axis=1,inplace=True)
df
W
X
Y
Z
A
2.706850
0.628133
0.907969
0.503826
B
0.651118
-0.319318
-0.848077
0.605965
C
-2.018168
0.740122
0.528813
-0.589001
D
0.188695
-0.758872
-0.933237
0.955057
E
0.190794
1.978757
2.605967
0.683509
Can also drop rows this way:
df.drop('E',axis=0)
W
X
Y
Z
A
2.706850
0.628133
0.907969
0.503826
B
0.651118
-0.319318
-0.848077
0.605965
C
-2.018168
0.740122
0.528813
-0.589001
D
0.188695
-0.758872
-0.933237
0.955057
Selecting Rows
df.loc['A']
W 2.706850
X 0.628133
Y 0.907969
Z 0.503826
Name: A, dtype: float64
Or select based off of position instead of label
df.iloc[2]
W -2.018168
X 0.740122
Y 0.528813
Z -0.589001
Name: C, dtype: float64
Selecting subset of rows and columns
df.loc['B','Y']
-0.8480769834036315
df.loc[['A','B'],['W','Y']]
W
Y
A
2.706850
0.907969
B
0.651118
-0.848077
Conditional Selection
An important feature of pandas is conditional selection using bracket notation, very similar to numpy:
df
W
X
Y
Z
A
2.706850
0.628133
0.907969
0.503826
B
0.651118
-0.319318
-0.848077
0.605965
C
-2.018168
0.740122
0.528813
-0.589001
D
0.188695
-0.758872
-0.933237
0.955057
E
0.190794
1.978757
2.605967
0.683509
df>0
W
X
Y
Z
A
True
True
True
True
B
True
False
False
True
C
False
True
True
False
D
True
False
False
True
E
True
True
True
True
df[df>0]
W
X
Y
Z
A
2.706850
0.628133
0.907969
0.503826
B
0.651118
NaN
NaN
0.605965
C
NaN
0.740122
0.528813
NaN
D
0.188695
NaN
NaN
0.955057
E
0.190794
1.978757
2.605967
0.683509
df[df['W']>0]
W
X
Y
Z
A
2.706850
0.628133
0.907969
0.503826
B
0.651118
-0.319318
-0.848077
0.605965
D
0.188695
-0.758872
-0.933237
0.955057
E
0.190794
1.978757
2.605967
0.683509
df[df['W']>0]['Y']
A 0.907969
B -0.848077
D -0.933237
E 2.605967
Name: Y, dtype: float64
df[df['W']>0][['Y','X']]
Y
X
A
0.907969
0.628133
B
-0.848077
-0.319318
D
-0.933237
-0.758872
E
2.605967
1.978757
For two conditions you can use | and & with parenthesis:
df[(df['W']>0) & (df['Y'] >1)]
W
X
Y
Z
E
0.190794
1.978757
2.605967
0.683509
More Index Details
Let’s discuss some more features of indexing, including resetting the index or setting it something else. We’ll also talk about index hierarchy!
df
W
X
Y
Z
A
2.706850
0.628133
0.907969
0.503826
B
0.651118
-0.319318
-0.848077
0.605965
C
-2.018168
0.740122
0.528813
-0.589001
D
0.188695
-0.758872
-0.933237
0.955057
E
0.190794
1.978757
2.605967
0.683509
# Reset to default 0,1...n indexdf.reset_index()
index
W
X
Y
Z
0
A
2.706850
0.628133
0.907969
0.503826
1
B
0.651118
-0.319318
-0.848077
0.605965
2
C
-2.018168
0.740122
0.528813
-0.589001
3
D
0.188695
-0.758872
-0.933237
0.955057
4
E
0.190794
1.978757
2.605967
0.683509
newind ='CA NY WY OR CO'.split()
df['States'] = newind
df
W
X
Y
Z
States
A
2.706850
0.628133
0.907969
0.503826
CA
B
0.651118
-0.319318
-0.848077
0.605965
NY
C
-2.018168
0.740122
0.528813
-0.589001
WY
D
0.188695
-0.758872
-0.933237
0.955057
OR
E
0.190794
1.978757
2.605967
0.683509
CO
df.set_index('States')
W
X
Y
Z
States
CA
2.706850
0.628133
0.907969
0.503826
NY
0.651118
-0.319318
-0.848077
0.605965
WY
-2.018168
0.740122
0.528813
-0.589001
OR
0.188695
-0.758872
-0.933237
0.955057
CO
0.190794
1.978757
2.605967
0.683509
df
W
X
Y
Z
States
A
2.706850
0.628133
0.907969
0.503826
CA
B
0.651118
-0.319318
-0.848077
0.605965
NY
C
-2.018168
0.740122
0.528813
-0.589001
WY
D
0.188695
-0.758872
-0.933237
0.955057
OR
E
0.190794
1.978757
2.605967
0.683509
CO
df.set_index('States',inplace=True)
df
W
X
Y
Z
States
CA
2.706850
0.628133
0.907969
0.503826
NY
0.651118
-0.319318
-0.848077
0.605965
WY
-2.018168
0.740122
0.528813
-0.589001
OR
0.188695
-0.758872
-0.933237
0.955057
CO
0.190794
1.978757
2.605967
0.683509
DataFrame Summaries
There are a couple of ways to obtain summary data on DataFrames. df.describe() provides summary statistics on all numerical columns. df.info and df.dtypes displays the data type of all columns.
df.describe()
W
X
Y
Z
count
5.000000
5.000000
5.000000
5.000000
mean
0.343858
0.453764
0.452287
0.431871
std
1.681131
1.061385
1.454516
0.594708
min
-2.018168
-0.758872
-0.933237
-0.589001
25%
0.188695
-0.319318
-0.848077
0.503826
50%
0.190794
0.628133
0.528813
0.605965
75%
0.651118
0.740122
0.907969
0.683509
max
2.706850
1.978757
2.605967
0.955057
df.dtypes
W float64
X float64
Y float64
Z float64
dtype: object
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, CA to CO
Data columns (total 4 columns):
W 5 non-null float64
X 5 non-null float64
Y 5 non-null float64
Z 5 non-null float64
dtypes: float64(4)
memory usage: 200.0+ bytes