dplyr vs. pandas
Here’s a markdown table comparing dplyr
(R) and pandas
(Python) for various data manipulation tasks:
Data frame verbs
Rows
Feature dplyr (R) pandas (Python) Arrange arrange(df, col)
df.sort_values(by='col', ascending=True)
Distinct distinct(df, col)
df.drop_duplicates(subset='col')
Filter filter(df, condition)
df[df['condition']]
Slice slice(df, rows)
df.iloc[rows]
Columns
Feature dplyr (R) pandas (Python) Glimpse glimpse(df)
df.info()
Mutate mutate(df, new_col = func(old_col))
df['new_col'] = df['old_col'].apply(func)
Pull pull(df, col)
df['col']
Rename rename(df, new_col_name = old_col_name)
df.rename(columns={'old_col_name': 'new_col_name'}, inplace=True)
Select select(df, col1, col2, ...)
df[['col1', 'col2', ...]]
Groups
Feature dplyr (R) pandas (Python) Group By group_by(df, col)
df.groupby('col')
Summarise summarise(df, new_col = func(col))
df.agg({'col': func})
Data frames
Feature dplyr (R) pandas (Python) bind_cols bind_cols(df1, df2)
pd.concat([df1, df2], axis=1)
bind_rows bind_rows(df1, df2)
pd.concat([df1, df2], axis=0)
Inner Join inner_join(df1, df2, by = "key_column")
pd.merge(df1, df2, on='key_column', how='inner')
Left Join left_join(df1, df2, by = "key_column")
pd.merge(df1, df2, on='key_column', how='left')
Right Join right_join(df1, df2, by = "key_column")
pd.merge(df1, df2, on='key_column', how='right')
Full Join full_join(df1, df2, by = "key_column")
pd.merge(df1, df2, on='key_column', how='outer')
Semi Join semi_join(df1, df2, by = "key_column")
Not directly available; can use merge
andisin
together for similar effect.Anti Join anti_join(df1, df2, by = "key_column")
Not directly available; can use merge
andisin
together for similar effect.Vector functions
Feature dplyr (R) pandas (Python) if_else mutate(df, new_col = if_else(condition, true_val, false_val))
df['new_col'] = np.where(condition, true_val, false_val)
na_if mutate(df, col = na_if(col, value))
df['col'].replace(value, np.nan)
n_distinct n_distinct(df, col)
df['col'].nunique()
sample_n sample_n(df, n)
df.sample(n=n)
sample_frac sample_frac(df, fraction)
df.sample(frac=fraction)
case_when mutate(df, new_col = case_when(condition1 ~ value1, condition2 ~ value2, ...))
df['new_col'] = np.select([condition1, condition2, ...], [value1, value2, ...], default=default_value)
cummean mutate(df, new_col = cummean(col))
df['new_col'] = df['col'].expanding().mean()
row_number mutate(df, row_num = row_number())
df['row_num'] = range(1, len(df)+1)
min_rank mutate(df, rank = min_rank(col))
df['rank'] = df['col'].rank(method='min')
dense_rank mutate(df, rank = dense_rank(col))
df['rank'] = df['col'].rank(method='dense')
Note that while many functionalities are directly available in both
dplyr
andpandas
, some might require slight variations or custom functions to achieve the same result.