This GitLab repository contains the main material used in the paper [title] by [authors].
This GitLab repository contains the main material used in the paper [title] by [authors].
The study examines patients undergoing treatment for alcohol disorders, utilizing machine learning techniques to predict clinical success or withdrawal. The main goal is to employ explainability tools to assess the impact of individual versus social factors on treatment outcomes. Additionally, the research explores whether the significance of these factors changed during the pandemic by comparing pre-pandemic and post-pandemic patient groups.
The study examines patients undergoing treatment for alcohol disorders, utilizing machine learning techniques to predict clinical success or withdrawal. The main goal is to employ explainability tools to assess the impact of individual versus social factors on treatment outcomes. Additionally, the research explores whether the significance of these factors changed during the pandemic by comparing pre-pandemic and post-pandemic patient groups.
[Impact?]
[Impact?]
## About the Dataset
## About the Dataset
[Origin, Characteristics]
[Origin, Characteristics]
The dataset has not been provided since the authors do not have permission for its sharing from the data providers.
The dataset has not been provided since the authors do not have permission for its sharing from the data providers.
## Dealing with Class Imbalance
## Dealing with Class Imbalance
One of the primary challenges we encountered was a significant class imbalance, with a higher number of patients withdrawing from treatment compared to those staying.
One of the primary challenges we encountered was a significant class imbalance, with a higher number of patients withdrawing from treatment compared to those staying.
To address this issue, we implemented four different training approaches or pipelines on both the pre-pandemic and post-pandemic training datasets:
To address this issue, we implemented four different training approaches or pipelines on both the pre-pandemic and post-pandemic training datasets:
1.**Using the Original Dataset (ORIG)**: The models were trained on the original datasets.
1.**Using the Original Dataset (ORIG)**: The models were trained on the original datasets.
2.**Class Weight Adjustment (ORIG_CW)**: The models were trained on the original datasets but were penalized more heavily for misclassifying the minority class.
2.**Class Weight Adjustment (ORIG_CW)**: The models were trained on the original datasets but were penalized more heavily for misclassifying the minority class.
3.**Oversampling (OVER)**: Additional samples were generated for the minority class (patients staying) to balance the dataset.
3.**Oversampling (OVER)**: Additional samples were generated for the minority class (patients staying) to balance the dataset.
...
@@ -24,32 +28,36 @@ To address this issue, we implemented four different training approaches or pipe
...
@@ -24,32 +28,36 @@ To address this issue, we implemented four different training approaches or pipe
These approaches resulted in multiple training datasets. However, to ensure a fair comparison of the models' performance across different pipelines, we utilized a common test dataset for evaluation, irrespective of the training approach followed.
These approaches resulted in multiple training datasets. However, to ensure a fair comparison of the models' performance across different pipelines, we utilized a common test dataset for evaluation, irrespective of the training approach followed.
*[output](./EDA/output): Plots about feature distributions, correlations and importance.
*[EDA](./EDA):
*[EDA.ipynb](./EDA/EDA.ipynb): Exploring and filtering data, handling missing values, encoding variables, building the final pre- and post- pandemic datasets, and generating plots for feature distributions, correlations and importance.
*[results](./EDA/results): Plots about feature distributions, correlations and importance.
*[gen_train_data](./gen_train_data):
*[EDA.ipynb](./EDA/EDA.ipynb): Exploring and filtering data, handling missing values, encoding variables, building the final pre- and post- pandemic datasets, and generating plots for feature distributions, correlations and importance.
*[gen_train_data.ipynb](./gen_train_data/gen_train_data.ipynb): Generating training and testing datasets for each of the pipelines.
*[gen_train_data](./gen_train_data):
*[model_selection](./model_selection):
*[gen_train_data.ipynb](./gen_train_data/gen_train_data.ipynb): Generating training and testing datasets for each of the pipelines.
*[hyperparam_tuning.py](./model_selection/hyperparam_tuning.py): Tuning models through a random search of hyperparameters.
*[model_selection](./model_selection):
*[cv_metric_gen.py](./model_selection/cv_metric_gen.py): Generating cross-validation metrics and plots for each of the tuned models.
*[hyperparam_tuning.py](./model_selection/hyperparam_tuning.py): Tuning models through a random search of hyperparameters.
*[cv_metrics_distr.py](./model_selection/cv_metrics_distr.py): Generating boxplots for each cross-validation metric and tuned model.
*[cv_metric_gen.py](./model_selection/cv_metric_gen.py): Generating cross-validation metrics and plots for each of the tuned models.
*[test_models.py](./model_selection/test_models.py): Testing tuned models with test dataset.
*[cv_metrics_distr.py](./model_selection/cv_metrics_distr.py): Generating boxplots for each cross-validation metric and tuned model.
*[fit_final_models.py](./model_selection/fit_final_models.py): Saving fitted model for each selected final model.
*[test_models.py](./model_selection/test_models.py): Testing tuned models with test dataset.
*[output](./model_selection/output):
*[fit_final_models.py](./model_selection/fit_final_models.py): Saving fitted model for each selected final model.
*[hyperparam](./model_selection/output/hyperparam): Excel file containing the optimal hyperparameters for each model in each pipeline.
*[results](./model_selection/results):
*[cv_metrics](./model_selection/output/cv_metrics): Material related to the results of cross-validation: scores, ROC and Precision-Recall curves and boxplots for each metric.
*[hyperparam](./model_selection/output/hyperparam): Excel file containing the optimal hyperparameters for each model in each pipeline.
*[testing](./model_selection/output/testing): Material related to the results of testing the tuned models: scores, ROC and Precision-Recall curves and confusion matrices.
*[cv_metrics](./model_selection/output/cv_metrics): Material related to the results of cross-validation: scores, ROC and Precision-Recall curves and boxplots for each metric.
*[fitted_models](./model_selection/output/fitted_models): Final selected trained models.
*[testing](./model_selection/output/testing): Material related to the results of testing the tuned models: scores, ROC and Precision-Recall curves and confusion matrices.
*[fitted_models](./model_selection/output/fitted_models): Final selected trained models.
*[explainability](./explainability):
*[explainability](./explainability):
*[compute_shap_vals.py](./explainability/compute_shap_vals.py): Computing SHAP values for final models.
*[compute_shap_vals.py](./explainability/compute_shap_vals.py): Computing SHAP values for final models.
*[compute_shap_inter_vals.py](./explainability/compute_shap_inter_vals.py): Computing SHAP interaction values for final models.
*[compute_shap_inter_vals.py](./explainability/compute_shap_inter_vals.py): Computing SHAP interaction values for final models.
*[shap_plots.py](./explainability/shap_plots.py): Generating SHAP summary plots for the SHAP and SHAP interaction values computed. Comparing major differences between pre- and post-pandemic groups.
*[shap_plots.py](./explainability/shap_plots.py): Generating SHAP summary plots for the SHAP and SHAP interaction values computed. Comparing major differences between pre- and post-pandemic groups.
*[output](./explainability/output): SHAP and SHAP interaction summary plots.
*[results](./explainability/results): SHAP and SHAP interaction summary plots.