Update README.md

52007cea · Rafael Artinano · 49a34461 · 52007cea
Commit 52007cea authored Mar 04, 2025 by Rafael Artinano
Show whitespace changes
Inline Side-by-side

Showing with 52 additions and 8 deletions

README.md README.md +52 -8

No files found.
--- a/README.md
+++ b/README.md
@@ -37,21 +37,65 @@ Note: It is recommended to use C++11 and pybind11.
 ## Usage
-First the file generate_the_excel.py filters the original dataset and provides a file without repetitions of the same proteins with different names.
+### Data Preprocessing
+(Optional) Run generate_the_excel.py to filter the original dataset and remove repetitions of proteins with different names.
-In case of replicating the similarity matrices, compute_distance_mat.py provides the distance matrices based on the input wanted. To replicate the experiment, if used blosum62, the function generate_nwmodpremade for any other cases is generate_nwmod.
+The data used is available in Output in case the objective is to replicate the experiment from later steps.
-For the pattern search algorithm one have to introduce the data asked in param_file.conf and execute patterns.py 
+### Similarity Matrices
-The files summary.py and similarityAllProteins.py provide the summary files of pattern search and similarity matrices. 
+To replicate the similarity matrices:
-Once every previous file is generated the code in Code statistical methods provide the graphics and statistical analysis as the ones shown in the publication.
+Use compute_distance_mat.py to generate distance matrices.
-The sandkeys are generated in Pattern found - Sankey plots.ipynb
+1. For BLOSUM62, use the generate_nwmodpremade function.
-Analysis_of_similarities-patterns_significance__Simi_BLOSUM.ipynb, Analysis_of_similarities-patterns_significance__Simi_AA.ipynb and Analysis_of_similarities-patterns_significance__Simi_AA2.ipynb searches the similarities using the similarity matrices previously computed for blosum, mod1(AA) and mod2(AA2). 
+2. For other cases, use generate_nwmod.
-Analysis_of_similarities-patterns_significance__DR.ipynb stablish the proposed drug repurposing analysis. 
+### Pattern Search
+Configure the param_file.conf file with the required data.
+Execute patterns.py to run the pattern search algorithm using the data in param_file.conf.
+### Statistical Analysis
+Use summary.py and similarityAllProteins.py to generate summary files for pattern search and similarity matrices.
+The code in Code statistical methods provides the graphics and statistical analysis shown in the publication.
+In case of replicating the sandkeys plots use the notebook: **Pattern found - Sankey plots.ipynb**.
+The analysis is divided in the 3 following parts:
+1. **Analysis_of_similarities-patterns_significance__Simi_BLOSUM.ipynb** (for BLOSUM).
+2. **Analysis_of_similarities-patterns_significance__Simi_AA.ipynb** (for mod1/AA).
+3. **Analysis_of_similarities-patterns_significance__Simi_AA2.ipynb** (for mod2/AA2).
+Each of the files provide the statistical analysis for each of the inputs provided to the similarity algorithm. 
+Perform drug repurposing analysis is reflected in: **Analysis_of_similarities-patterns_significance__DR.ipynb**.
+## Step-by-Step Experiment Reproduction
+To reproduce the steps followed during the experiment:
+###Preprocess the Data
+Begin by preprocessing the data to ensure it is clean and ready for analysis.
+####Calculate Protein Similarities
+Compute the similarity between proteins for each dataset. Use different matrices (e.g., BLOSUM62, AA, AA2) as input for each case.
+###Identify Patterns
+Search for patterns in the treatment_lung_cancer.csv file. Then, locate these patterns in the data_cancers_disease.csv file.
+###Perform Statistical Analysis
+Conduct the statistical analysis to validate the findings.
+###Generate Visualizations
+Plot the graphics to visualize the results and support the conclusions.