@@ -37,21 +37,65 @@ Note: It is recommended to use C++11 and pybind11.
## Usage
First the file generate_the_excel.py filters the original dataset and provides a file without repetitions of the same proteins with different names.
### Data Preprocessing
(Optional) Run generate_the_excel.py to filter the original dataset and remove repetitions of proteins with different names.
In case of replicating the similarity matrices, compute_distance_mat.py provides the distance matrices based on the input wanted. To replicate the experiment, if used blosum62, the function generate_nwmodpremade for any other cases is generate_nwmod.
The data used is available in Output in case the objective is to replicate the experiment from later steps.
For the pattern search algorithm one have to introduce the data asked in param_file.conf and execute patterns.py
### Similarity Matrices
The files summary.py and similarityAllProteins.py provide the summary files of pattern search and similarity matrices.
To replicate the similarity matrices:
Once every previous file is generated the code in Code statistical methods provide the graphics and statistical analysis as the ones shown in the publication.
Use compute_distance_mat.py to generate distance matrices.
The sandkeys are generated in Pattern found - Sankey plots.ipynb
1. For BLOSUM62, use the generate_nwmodpremade function.
Analysis_of_similarities-patterns_significance__Simi_BLOSUM.ipynb, Analysis_of_similarities-patterns_significance__Simi_AA.ipynb and Analysis_of_similarities-patterns_significance__Simi_AA2.ipynb searches the similarities using the similarity matrices previously computed for blosum, mod1(AA) and mod2(AA2).
2. For other cases, use generate_nwmod.
Analysis_of_similarities-patterns_significance__DR.ipynb stablish the proposed drug repurposing analysis.
### Pattern Search
Configure the param_file.conf file with the required data.
Execute patterns.py to run the pattern search algorithm using the data in param_file.conf.
### Statistical Analysis
Use summary.py and similarityAllProteins.py to generate summary files for pattern search and similarity matrices.
The code in Code statistical methods provides the graphics and statistical analysis shown in the publication.
In case of replicating the sandkeys plots use the notebook: **Pattern found - Sankey plots.ipynb**.