@@ -7,11 +7,19 @@ Repository of the paper: Finding patterns in lung cancer protein sequences for d
## Requirements
All python requirements are listed in the file requirements.txt, to install
### Python Dependencies
All Python requirements are listed in `requirements.txt`. Install them using:
```bash
pip install-r requirements.txt
python3 install -r requirements.txt
### C++ Code
The project also uses C++ code. To compile it:
It also used c++ code. To compile it, use the CMAKELISTS.txt in Code Approach 1 and 2 (2.1 - 2.2). It is recommended to use c++11 and pybind11
1. Navigate to the Code Approach 1 and 2 (2.1 - 2.2) directory.
2. Use the provided CMAKELISTS.txt file.
3. It is recommended to use C++11 and pybind11.
## Contents
...
...
@@ -29,17 +37,17 @@ Code Approach 1 and 2 (2.1 - 2.2) contain the pattern searching algorithms as we
First the file generate_the_excel.py filters the original dataset and provides a file without repetitions of the same proteins with different names.
In case of replicating the similarity matrixes, compute_distance_mat.py provides the distance matrixes based on the input wanted. To replicate the experiment, if used blosum62, the function generate_nwmodpremade for any other cases is generate_nwmod.
In case of replicating the similarity matrices, compute_distance_mat.py provides the distance matrices based on the input wanted. To replicate the experiment, if used blosum62, the function generate_nwmodpremade for any other cases is generate_nwmod.
For the pattern search algorithm one have to introduce the data asked in param_file.conf and execute patterns.py
The files summary.py and similarityAllProteins.py provide the summary files of pattern search and similarity matrixes.
The files summary.py and similarityAllProteins.py provide the summary files of pattern search and similarity matrices.
Once every previous file is generated the code in Code statistical methods provide the graphics and statistical analysis as the ones shown in the publication.
The sandkeys are generated in Pattern found - Sankey plots.ipynb
Analysis_of_similarities-patterns_significance__Simi_BLOSUM.ipynb, Analysis_of_similarities-patterns_significance__Simi_AA.ipynb and Analysis_of_similarities-patterns_significance__Simi_AA2.ipynb searches the similarities using the similarity matrixes previously computed for blosum, mod1(AA) and mod2(AA2).
Analysis_of_similarities-patterns_significance__Simi_BLOSUM.ipynb, Analysis_of_similarities-patterns_significance__Simi_AA.ipynb and Analysis_of_similarities-patterns_significance__Simi_AA2.ipynb searches the similarities using the similarity matrices previously computed for blosum, mod1(AA) and mod2(AA2).
Analysis_of_similarities-patterns_significance__DR.ipynb stablish the proposed drug repurposing analysis.