dplyr vs. pandas

Here’s a markdown table comparing dplyr (R) and pandas (Python) for various data manipulation tasks:

Data frame verbs

Rows

Feature	dplyr (R)	pandas (Python)
Arrange	`arrange(df, col)`	`df.sort_values(by='col', ascending=True)`
Distinct	`distinct(df, col)`	`df.drop_duplicates(subset='col')`
Filter	`filter(df, condition)`	`df[df['condition']]`
Slice	`slice(df, rows)`	`df.iloc[rows]`

Columns

Feature	dplyr (R)	pandas (Python)
Glimpse	`glimpse(df)`	`df.info()`
Mutate	`mutate(df, new_col = func(old_col))`	`df['new_col'] = df['old_col'].apply(func)`
Pull	`pull(df, col)`	`df['col']`
Rename	`rename(df, new_col_name = old_col_name)`	`df.rename(columns={'old_col_name': 'new_col_name'}, inplace=True)`
Select	`select(df, col1, col2, ...)`	`df[['col1', 'col2', ...]]`

Groups

Feature dplyr (R) pandas (Python)

Group By group_by(df, col) df.groupby('col')

Summarise summarise(df, new_col = func(col)) df.agg({'col': func})

Feature	dplyr (R)	pandas (Python)
Group By	`group_by(df, col)`	`df.groupby('col')`
Summarise	`summarise(df, new_col = func(col))`	`df.agg({'col': func})`

Data frames

Feature	dplyr (R)	pandas (Python)
bind_cols	`bind_cols(df1, df2)`	`pd.concat([df1, df2], axis=1)`
bind_rows	`bind_rows(df1, df2)`	`pd.concat([df1, df2], axis=0)`
Inner Join	`inner_join(df1, df2, by = "key_column")`	`pd.merge(df1, df2, on='key_column', how='inner')`
Left Join	`left_join(df1, df2, by = "key_column")`	`pd.merge(df1, df2, on='key_column', how='left')`
Right Join	`right_join(df1, df2, by = "key_column")`	`pd.merge(df1, df2, on='key_column', how='right')`
Full Join	`full_join(df1, df2, by = "key_column")`	`pd.merge(df1, df2, on='key_column', how='outer')`
Semi Join	`semi_join(df1, df2, by = "key_column")`	Not directly available; can use `merge` and `isin` together for similar effect.
Anti Join	`anti_join(df1, df2, by = "key_column")`	Not directly available; can use `merge` and `isin` together for similar effect.

Vector functions

Feature	dplyr (R)	pandas (Python)
if_else	`mutate(df, new_col = if_else(condition, true_val, false_val))`	`df['new_col'] = np.where(condition, true_val, false_val)`
na_if	`mutate(df, col = na_if(col, value))`	`df['col'].replace(value, np.nan)`
n_distinct	`n_distinct(df, col)`	`df['col'].nunique()`
sample_n	`sample_n(df, n)`	`df.sample(n=n)`
sample_frac	`sample_frac(df, fraction)`	`df.sample(frac=fraction)`
case_when	`mutate(df, new_col = case_when(condition1 ~ value1, condition2 ~ value2, ...))`	`df['new_col'] = np.select([condition1, condition2, ...], [value1, value2, ...], default=default_value)`
cummean	`mutate(df, new_col = cummean(col))`	`df['new_col'] = df['col'].expanding().mean()`
row_number	`mutate(df, row_num = row_number())`	`df['row_num'] = range(1, len(df)+1)`
min_rank	`mutate(df, rank = min_rank(col))`	`df['rank'] = df['col'].rank(method='min')`
dense_rank	`mutate(df, rank = dense_rank(col))`	`df['rank'] = df['col'].rank(method='dense')`

Note that while many functionalities are directly available in both dplyr and pandas, some might require slight variations or custom functions to achieve the same result.