Commit 52007cea authored by Rafael Artinano's avatar Rafael Artinano

Update README.md

parent 49a34461
...@@ -37,21 +37,65 @@ Note: It is recommended to use C++11 and pybind11. ...@@ -37,21 +37,65 @@ Note: It is recommended to use C++11 and pybind11.
## Usage ## Usage
First the file generate_the_excel.py filters the original dataset and provides a file without repetitions of the same proteins with different names. ### Data Preprocessing
(Optional) Run generate_the_excel.py to filter the original dataset and remove repetitions of proteins with different names.
In case of replicating the similarity matrices, compute_distance_mat.py provides the distance matrices based on the input wanted. To replicate the experiment, if used blosum62, the function generate_nwmodpremade for any other cases is generate_nwmod. The data used is available in Output in case the objective is to replicate the experiment from later steps.
For the pattern search algorithm one have to introduce the data asked in param_file.conf and execute patterns.py ### Similarity Matrices
The files summary.py and similarityAllProteins.py provide the summary files of pattern search and similarity matrices. To replicate the similarity matrices:
Once every previous file is generated the code in Code statistical methods provide the graphics and statistical analysis as the ones shown in the publication. Use compute_distance_mat.py to generate distance matrices.
The sandkeys are generated in Pattern found - Sankey plots.ipynb 1. For BLOSUM62, use the generate_nwmodpremade function.
Analysis_of_similarities-patterns_significance__Simi_BLOSUM.ipynb, Analysis_of_similarities-patterns_significance__Simi_AA.ipynb and Analysis_of_similarities-patterns_significance__Simi_AA2.ipynb searches the similarities using the similarity matrices previously computed for blosum, mod1(AA) and mod2(AA2). 2. For other cases, use generate_nwmod.
Analysis_of_similarities-patterns_significance__DR.ipynb stablish the proposed drug repurposing analysis. ### Pattern Search
Configure the param_file.conf file with the required data.
Execute patterns.py to run the pattern search algorithm using the data in param_file.conf.
### Statistical Analysis
Use summary.py and similarityAllProteins.py to generate summary files for pattern search and similarity matrices.
The code in Code statistical methods provides the graphics and statistical analysis shown in the publication.
In case of replicating the sandkeys plots use the notebook: **Pattern found - Sankey plots.ipynb**.
The analysis is divided in the 3 following parts:
1. **Analysis_of_similarities-patterns_significance__Simi_BLOSUM.ipynb** (for BLOSUM).
2. **Analysis_of_similarities-patterns_significance__Simi_AA.ipynb** (for mod1/AA).
3. **Analysis_of_similarities-patterns_significance__Simi_AA2.ipynb** (for mod2/AA2).
Each of the files provide the statistical analysis for each of the inputs provided to the similarity algorithm.
Perform drug repurposing analysis is reflected in: **Analysis_of_similarities-patterns_significance__DR.ipynb**.
## Step-by-Step Experiment Reproduction
To reproduce the steps followed during the experiment:
###Preprocess the Data
Begin by preprocessing the data to ensure it is clean and ready for analysis.
####Calculate Protein Similarities
Compute the similarity between proteins for each dataset. Use different matrices (e.g., BLOSUM62, AA, AA2) as input for each case.
###Identify Patterns
Search for patterns in the treatment_lung_cancer.csv file. Then, locate these patterns in the data_cancers_disease.csv file.
###Perform Statistical Analysis
Conduct the statistical analysis to validate the findings.
###Generate Visualizations
Plot the graphics to visualize the results and support the conclusions.
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment