diff --git a/README.md b/README.md index 0022e06a5069de67d15411f35489889226b29bcc..b147685a2e4dd7125872e95315f83dce0df81124 100644 --- a/README.md +++ b/README.md @@ -37,21 +37,65 @@ Note: It is recommended to use C++11 and pybind11. ## Usage -First the file generate_the_excel.py filters the original dataset and provides a file without repetitions of the same proteins with different names. +### Data Preprocessing +(Optional) Run generate_the_excel.py to filter the original dataset and remove repetitions of proteins with different names. -In case of replicating the similarity matrices, compute_distance_mat.py provides the distance matrices based on the input wanted. To replicate the experiment, if used blosum62, the function generate_nwmodpremade for any other cases is generate_nwmod. +The data used is available in Output in case the objective is to replicate the experiment from later steps. -For the pattern search algorithm one have to introduce the data asked in param_file.conf and execute patterns.py +### Similarity Matrices -The files summary.py and similarityAllProteins.py provide the summary files of pattern search and similarity matrices. +To replicate the similarity matrices: -Once every previous file is generated the code in Code statistical methods provide the graphics and statistical analysis as the ones shown in the publication. +Use compute_distance_mat.py to generate distance matrices. -The sandkeys are generated in Pattern found - Sankey plots.ipynb +1. For BLOSUM62, use the generate_nwmodpremade function. -Analysis_of_similarities-patterns_significance__Simi_BLOSUM.ipynb, Analysis_of_similarities-patterns_significance__Simi_AA.ipynb and Analysis_of_similarities-patterns_significance__Simi_AA2.ipynb searches the similarities using the similarity matrices previously computed for blosum, mod1(AA) and mod2(AA2). +2. For other cases, use generate_nwmod. -Analysis_of_similarities-patterns_significance__DR.ipynb stablish the proposed drug repurposing analysis. +### Pattern Search +Configure the param_file.conf file with the required data. + +Execute patterns.py to run the pattern search algorithm using the data in param_file.conf. + +### Statistical Analysis + +Use summary.py and similarityAllProteins.py to generate summary files for pattern search and similarity matrices. + +The code in Code statistical methods provides the graphics and statistical analysis shown in the publication. + +In case of replicating the sandkeys plots use the notebook: **Pattern found - Sankey plots.ipynb**. + +The analysis is divided in the 3 following parts: + +1. **Analysis_of_similarities-patterns_significance__Simi_BLOSUM.ipynb** (for BLOSUM). + +2. **Analysis_of_similarities-patterns_significance__Simi_AA.ipynb** (for mod1/AA). + +3. **Analysis_of_similarities-patterns_significance__Simi_AA2.ipynb** (for mod2/AA2). + +Each of the files provide the statistical analysis for each of the inputs provided to the similarity algorithm. + +Perform drug repurposing analysis is reflected in: **Analysis_of_similarities-patterns_significance__DR.ipynb**. + + +## Step-by-Step Experiment Reproduction + +To reproduce the steps followed during the experiment: + +###Preprocess the Data +Begin by preprocessing the data to ensure it is clean and ready for analysis. + +####Calculate Protein Similarities +Compute the similarity between proteins for each dataset. Use different matrices (e.g., BLOSUM62, AA, AA2) as input for each case. + +###Identify Patterns +Search for patterns in the treatment_lung_cancer.csv file. Then, locate these patterns in the data_cancers_disease.csv file. + +###Perform Statistical Analysis +Conduct the statistical analysis to validate the findings. + +###Generate Visualizations +Plot the graphics to visualize the results and support the conclusions.