Convert the MLR3 LightGBM model from R to Python

Author

Nguyễn Ngọc Bình

A. Example LightGBM model is created by MLR3 in R

Buile LightGBM model by the following steps:

  1. Loads the necessary libraries for mlr3, LightGBM, and data manipulation.
  2. Sets the logging threshold for mlr3 to the warning level.
  3. Loads the German Credit dataset and creates a classification task.
  4. Defines a preprocessing pipeline with specific operations such as imputation, encoding, and feature filtering.
  5. Sets parameter values for the preprocessing steps (e.g., filter fraction).
  6. Defines a LightGBM learner with a specified number of iterations.
  7. Combines the preprocessing and learner into a single pipeline.
  8. Creates a GraphLearner to encapsulate the pipeline.
  9. Trains the model on the classification task.
  10. Makes predictions on the task using the trained model.
  11. Extracts the LightGBM model from the pipeline.
  12. Specifies a filename for saving the LightGBM model.
  13. Saves the LightGBM model to a file.
# Load necessary libraries
library("mlr3verse")
library("mlr3learners")
library("mlr3tuning")
library("data.table")
library("ggplot2")

# Set logging threshold for mlr3 to warning level
lgr::get_logger("mlr3")$set_threshold("warn")

# Load the German Credit dataset from rchallenge package
# install rchallenge package if not install
data("german", package = "rchallenge")

# Create a classification task with target variable 'credit_risk'
task = as_task_classif(german, id = "GermanCredit", target = "credit_risk")

# Define preprocessing steps as a pipeline
preprocess <- po("imputeoor") %>>% 
  po("encodeimpact", param_vals = list(impute_zero = T)) %>>%  
  po("filter", flt("auc")) %>>%  
  po("filter", flt("find_correlation", method = "spearman", use = "na.or.complete"))

# Set parameter values for preprocessing steps
preprocess$param_set$values$auc.filter.frac <- 0.5
preprocess$param_set$values$find_correlation.filter.frac <- 0.5

# Define the learner (LightGBM)
learner <- lrn("classif.lightgbm", num_iterations = 100)

# Define the pipeline by combining preprocessing and the learner
pipeline <- preprocess %>>% learner

# Create a GraphLearner to encapsulate the pipeline
model <- GraphLearner$new(pipeline)

# Train the model
model$train(task)

# Make predictions
predictions <- model$predict(task)

B. Convert the MLR3 LightGBM model to Python

Step 1: Extract preprocessing

Extract the results of tuning for imputeoor and encodeimpact steps from the model

f_extract_impute <- function(col){
  val <- model$state$model$imputeoor$model[col][[col]]
  val
}
f_extract_encodeimpact <- function(col) {
  df <- model$state$model$encodeimpact$impact[col]
  df <- as.data.frame(df)
  df
}

Note: check final model in lightgbm model and only create preprocessing with them

# Access the classif.lightgbm learner model
model$state$model$classif.lightgbm

(1000 x 10) * Target: credit_risk * Properties: twoclass * Features (9): - dbl (7): credit_history.good, employment_duration.good, housing.good, personal_status_sex.bad, purpose.good, savings.good, status.good - int (2): age, amount

Step 2: Save LightGBM model

# Extract the trained LightGBM model from the pipeline
lightgbm_model <- model$state$model$classif.lightgbm$model

# Specify the filename for saving the LightGBM model
model_file <- "lightgbm_model.txt"

# Save the LightGBM model to a file
lgb.save(lightgbm_model, model_file)

Step 3: Create preprocessing function in python

# Function to impute values
def f_impute_values(missing_df):
    # Select the desired columns in the specified order
    sel_features = [
          "credit_history.good", "employment_duration.good", "housing.good",
            "personal_status_sex.bad", "purpose.good", "savings.good", "status.good",
            "age", "amount"
    ]
    
    # Sample data with feature names and impute values
    impute_data = pd.DataFrame({
        'featureName': ["age", "amount"],
        'impute_value': [-38, -17925]
    })

    # Filter impute_data to include only feature names that exist in missing_df
    impute_data = impute_data[impute_data['featureName'].isin(missing_df.columns)]

    # Create a dictionary of feature names and their impute values
    impute_dict = dict(zip(impute_data['featureName'], impute_data['impute_value']))

    # Impute missing values in missing_df based on the impute_dict
    missing_df.fillna(impute_dict, inplace=True)

    # Select the desired columns in the specified order
    missing_df = missing_df[sel_features]

    return missing_df

Step 4: Transfer the saved model file (“lightgbm_model.txt”) from R to your Python environment

In Python, use the LightGBM library to load the model from the saved file and make predictions. You’ll also need to load any necessary libraries and install LightGBM if you haven’t already:

import lightgbm as lgb
import pandas as pd

# Load the saved LightGBM model from the file
model = lgb.Booster(model_file='lightgbm_model.txt')

# Load your new data for prediction as a pandas DataFrame
# Replace 'new_data.csv' with the actual path to your data file
new_data = pd.read_csv('new_data.csv')

# preprocessing
imputed_df = f_impute_values(train_df)

# Make predictions on the new data
predictions = model.predict(imputed_df)

# The 'predictions' variable now contains the model's predictions for the new data

Make sure to replace 'new_data.csv' with the actual path to your new data file in the pd.read_csv line.

With these steps, you can load the MLR3 LightGBM model in Python, and then use it to make predictions on new data.