TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial…

Follow publication

AI-Driven Feature Selection in Python!

Deep-dive on ML techniques for feature selection in Python — Part 3

Indraneel Dutta Baruah
TDS Archive
Published in
9 min readJul 10, 2022

--

https://unsplash.com/photos/bTRsbY5RLr4

A) BorutaPy

Image by author
#7. Select features based on BorutaPy method# BorutaPy:
borutapy_estimator = "XGBoost"
borutapy_trials = 10
borutapy_green_blue = "both"
################################ Functions #############################################################def borutapy_feature_selection(data, train_target,borutapy_estimator,borutapy_trials,borutapy_green_blue):

#Inputs
# data - Input feature data
# train_target - Target variable training data
# borutapy_estimator - base model (default: XG Boost)
# borutapy_trials - number of iteration
# borutapy_green_blue - choice for green and blue features
## Initialize borutapy

if borutapy_estimator == "RandomForest":
# Manual Change in Parameters - RandomForest
# Link to function parameters - https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
estimator_borutapy=RandomForestClassifier(n_jobs = -1,
random_state=101,
max_depth=7)
elif borutapy_estimator == "LightGBM":
# Manual Change in Parameters - LightGBM
# Link to function parameters - https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html
estimator_borutapy=lgb.LGBMClassifier(n_jobs = -1,
random_state=101,
max_depth=7)
else:
# Manual Change in Parameters - XGBoost
# Link to function parameters - https://xgboost.readthedocs.io/en/stable/parameter.html
estimator_borutapy = XGBClassifier(n_jobs = -1,
random_state=101,
max_depth=7)
## fit Borutapy
# Manual Change in Parameters - Borutapy
# Link to function parameters - https://github.com/scikit-learn-contrib/boruta_py
borutapy = BorutaPy(estimator = estimator_borutapy,
n_estimators = 'auto',
max_iter = borutapy_trials)
borutapy.fit(np.array(data), np.array(train_target))

## print results
green_area = data.columns[borutapy.support_].to_list()
blue_area = data.columns[borutapy.support_weak_].to_list()
print('features in the green area:', green_area)
print('features in the blue area:', blue_area)
if borutapy_green_blue == "both":
borutapy_top_features = green_area + blue_area
else:
borutapy_top_features = green_area

borutapy_top_features_df =pd.DataFrame(borutapy_top_features,
columns = ['Feature'])
borutapy_top_features_df['Method'] = 'Borutapy'

return borutapy_top_features_df,borutapy
################################ Calculate borutapy #############################################################borutapy_top_features_df,boruta = borutapy_feature_selection(train_features_v2, train_target,borutapy_estimator,borutapy_trials,borutapy_green_blue)borutapy_top_features_df.head(n=20)
Image by author

B) Boruta SHAP

#8. Select features based on BorutaShap method# BorutaShap:
borutashap_estimator = "XGBoost"
borutashap_trials = 10
borutashap_green_blue = 'both'
################################ Functions #############################################################def borutashap_feature_selection(data, train_target,borutashap_estimator,borutashap_trials,borutashap_green_blue):

#Inputs
# data - Input feature data
# train_target - Target variable training data
# borutashap_estimator - base model (default: XG Boost)
# borutashap_trials - number of iteration
# borutashap_green_blue - choice for green and blue features
## Initialize borutashap

if borutashap_estimator == "RandomForest":
# Manual Change in Parameters - RandomForest
# Link to function parameters - https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
estimator_borutashap=RandomForestClassifier(n_jobs = -1,
random_state=1,
max_depth=7)
elif borutashap_estimator == "LightGBM":
# Manual Change in Parameters - LightGBM
# Link to function parameters - https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html
estimator_borutashap=lgb.LGBMClassifier(n_jobs = -1,
random_state=101,
max_depth=7)
else:
# Manual Change in Parameters - XGBoost
# Link to function parameters - https://xgboost.readthedocs.io/en/stable/parameter.html
estimator_borutashap=XGBClassifier(n_jobs = -1,
random_state=101,
max_depth=7)
## fit BorutaShap
# Manual Change in Parameters - BorutaShap
# Link to function parameters - https://github.com/scikit-learn-contrib/boruta_py
borutashap = BorutaShap(model = estimator_borutashap,
importance_measure = 'shap',
classification = True)
borutashap.fit(X = data, y = train_target,
n_trials = borutashap_trials)

## print results
%matplotlib inline
borutashap.plot(which_features = 'all')
## print results
green_area = borutashap.accepted
blue_area = borutashap.tentative
print('features in the green area:', green_area)
print('features in the blue area:', blue_area)
if borutashap_green_blue == "both":
borutashap_top_features = green_area + blue_area
else:
borutashap_top_features = green_area

borutashap_top_features_df=pd.DataFrame(borutashap_top_features,
columns = ['Feature'])
borutashap_top_features_df['Method'] = 'Borutashap'
return borutashap_top_features_df,borutashap################################ Calculate borutashap #############################################################borutashap_top_features_df,borutashap = borutashap_feature_selection(train_features_v2, train_target,borutashap_estimator,borutashap_trials,borutashap_green_blue)
borutashap_top_features_df.head(n=20)
Image by author

C) Bringing it all together

Never put all your eggs in one basket

# Methods Selectedselected_method = [corr_top_features_df, woe_top_features_df,beta_top_features_df,lasso_top_features_df,
rfe_top_features_df,sfs_top_features_df,borutapy_top_features_df,borutashap_top_features_df]
# Combining features from all the models
master_df_feature_selection = pd.concat(selected_method, axis =0)
number_of_methods = len(selected_method)
selection_threshold = int(len(selected_method)/2)
print('Selecting features which are picked by more than ', selection_threshold, ' methods')
master_df_feature_selection_v2 = pd.DataFrame(master_df_feature_selection.groupby('Feature').size()).reset_index()
master_df_feature_selection_v2.columns = ['Features', 'Count_Method']
master_df_feature_selection_v3 = master_df_feature_selection_v2[master_df_feature_selection_v2['Count_Method']>selection_threshold]
final_features = master_df_feature_selection_v3['Features'].tolist()
print('Final Features Selected: ',final_features)
train_features_v2[final_features].hist(figsize = (14,14), xrot = 45)
plt.show()
master_df_feature_selection_v3.head(n=30)
Image by author

Final Words

Congratulations!

Reference Material

Let’s Connect!

Photo by Wilhelm Gunkel on Unsplash

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Indraneel Dutta Baruah
Indraneel Dutta Baruah

Written by Indraneel Dutta Baruah

Striving for excellence in solving business problems using AI!

Responses (1)