dplyr vs. pandas

Here’s a markdown table comparing dplyr (R) and pandas (Python) for various data manipulation tasks:

Data frame verbs

  1. Rows

    Feature dplyr (R) pandas (Python)
    Arrange arrange(df, col) df.sort_values(by='col', ascending=True)
    Distinct distinct(df, col) df.drop_duplicates(subset='col')
    Filter filter(df, condition) df[df['condition']]
    Slice slice(df, rows) df.iloc[rows]
  2. Columns

    Feature dplyr (R) pandas (Python)
    Glimpse glimpse(df) df.info()
    Mutate mutate(df, new_col = func(old_col)) df['new_col'] = df['old_col'].apply(func)
    Pull pull(df, col) df['col']
    Rename rename(df, new_col_name = old_col_name) df.rename(columns={'old_col_name': 'new_col_name'}, inplace=True)
    Select select(df, col1, col2, ...) df[['col1', 'col2', ...]]
  3. Groups

    Feature dplyr (R) pandas (Python)
    Group By group_by(df, col) df.groupby('col')
    Summarise summarise(df, new_col = func(col)) df.agg({'col': func})
  4. Data frames

    Feature dplyr (R) pandas (Python)
    bind_cols bind_cols(df1, df2) pd.concat([df1, df2], axis=1)
    bind_rows bind_rows(df1, df2) pd.concat([df1, df2], axis=0)
    Inner Join inner_join(df1, df2, by = "key_column") pd.merge(df1, df2, on='key_column', how='inner')
    Left Join left_join(df1, df2, by = "key_column") pd.merge(df1, df2, on='key_column', how='left')
    Right Join right_join(df1, df2, by = "key_column") pd.merge(df1, df2, on='key_column', how='right')
    Full Join full_join(df1, df2, by = "key_column") pd.merge(df1, df2, on='key_column', how='outer')
    Semi Join semi_join(df1, df2, by = "key_column") Not directly available; can use merge and isin together for similar effect.
    Anti Join anti_join(df1, df2, by = "key_column") Not directly available; can use merge and isin together for similar effect.
  5. Vector functions

    Feature dplyr (R) pandas (Python)
    if_else mutate(df, new_col = if_else(condition, true_val, false_val)) df['new_col'] = np.where(condition, true_val, false_val)
    na_if mutate(df, col = na_if(col, value)) df['col'].replace(value, np.nan)
    n_distinct n_distinct(df, col) df['col'].nunique()
    sample_n sample_n(df, n) df.sample(n=n)
    sample_frac sample_frac(df, fraction) df.sample(frac=fraction)
    case_when mutate(df, new_col = case_when(condition1 ~ value1, condition2 ~ value2, ...)) df['new_col'] = np.select([condition1, condition2, ...], [value1, value2, ...], default=default_value)
    cummean mutate(df, new_col = cummean(col)) df['new_col'] = df['col'].expanding().mean()
    row_number mutate(df, row_num = row_number()) df['row_num'] = range(1, len(df)+1)
    min_rank mutate(df, rank = min_rank(col)) df['rank'] = df['col'].rank(method='min')
    dense_rank mutate(df, rank = dense_rank(col)) df['rank'] = df['col'].rank(method='dense')

    Note that while many functionalities are directly available in both dplyr and pandas, some might require slight variations or custom functions to achieve the same result.