Tính toán giá trị SHAP sử dụng thư viện LightGBM

Author

Nguyễn Ngọc Bình

Cài đặt thư viện cần thiết:
- Đảm bảo rằng bạn đã cài đặt các thư viện shap, lightgbm, pandas, numpy, scikit-learn, và scipy.
```
pip install shap lightgbm pandas numpy scikit-learn scipy
```
Tải dữ liệu và chuẩn bị:
- Sử dụng load_breast_cancer từ sklearn.datasets để tải dữ liệu.
- Chuyển đổi dữ liệu thành DataFrame và chia thành tập huấn luyện và tập kiểm tra.

import pandas as pd
import numpy as np
import shap
import lightgbm as lgbm
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from scipy.special import expit

shap.initjs()
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html

Huấn luyện mô hình:
- Khởi tạo và huấn luyện một mô hình LightGBM với tập huấn luyện.

model = lgbm.LGBMClassifier(verbose=-1)
model.fit(X_train, y_train)

LGBMClassifier(verbose=-1)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Tính toán giá trị SHAP:
- Khởi tạo TreeExplainer với mô hình đã huấn luyện.
- Tính toán giá trị SHAP cho tập huấn luyện.

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_train)

LightGBM binary classifier with TreeExplainer shap values output has changed to a list of ndarray

Tạo biểu đồ SHAP:
- Lấy giá trị SHAP cho một hàng cụ thể và tạo biểu đồ force plot.

class_idx = 1  # Chỉ số của lớp (class) mà bạn quan tâm
row_idx = 0    # Chỉ số của hàng dữ liệu mà bạn quan tâm
expected_value = explainer.expected_value[class_idx]
shap_value = shap_values[class_idx][row_idx]
shap.force_plot(
    base_value=expected_value,
    shap_values=shap_value,
    features=X_train.iloc[row_idx, :],
    link="logit",
)

Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.

Tính toán xác suất SHAP:
- Dự đoán xác suất với mô hình.
- Tính giá trị trung bình của raw scores.
- Tính toán giá trị base value và tổng giá trị SHAP cho hàng cụ thể.
- Sử dụng hàm sigmoid (expit) để tính toán xác suất SHAP.

from scipy.special import expit

# Dự đoán xác suất
model_proba = model.predict_proba(X_train.iloc[[row_idx]])

# Giá trị raw scores trung bình
mean_raw_score = model.predict(X_train, raw_score=True).mean()

# Giá trị base value
bv = explainer.expected_value[class_idx]

# Tổng giá trị SHAP
sv_0 = shap_values[class_idx][row_idx].sum()

# Tính xác suất SHAP
shap_proba = expit(bv + sv_0)

print("Model Probability:", model_proba)
print("Mean Raw Score:", mean_raw_score)
print("Base Value:", bv)
print("Summed SHAP Values:", sv_0)
print("SHAP Probability:", shap_proba)

Model Probability: [[0.00275887 0.99724113]]
Mean Raw Score: 2.4839751932445577
Base Value: 2.4839751932445573
Summed SHAP Values: 3.40619583933988
SHAP Probability: 0.9972411289322419