Update README.md

parent 826e98e3
# Embeddings in DR
Additional `data.xlsx` descriptionExcel file with different sheets that contain the data used in the analysis.
1. Supplementary `data.xlsx` descriptionExcel file, available in the present Gitlab repository, with different sheets that contain the data used in the analysis.
<n>
- **DR cases – repoDB**
......@@ -46,4 +47,52 @@ The distance value for each protein pair is included for every embedding method.
- **PP by class – Distances**
The distance value for each protein pair filtered by PANTHERdb class is included forevery embedding method. We indicated if the protein pair belongede to repoDB, Literature or both datasets.
\ No newline at end of file
The distance value for each protein pair filtered by PANTHERdb class is included forevery embedding method. We indicated if the protein pair belongede to repoDB, Literature or both datasets.
2. Supplementary files for protein data used in the study, sequence embeddings from the four reviewed methods and protein pair distances, available in [](https://drive.upm.es/s/egBAv71on4AgBdn?path=%2Fdata)
**File folder structure **
embeddings/
├── Global_embedding.ipynb
│ └── **Script to execute protein sequence embedding retrieval for all methods**
├── OneHot.tsv
│ └── **Final embeddings in One Hot encoding**
├── SGT.tsv
│ └── **Final embeddings in Sequence Graph Transform encoding**
├── ProtBERT.tsv
│ └── **Final embeddings in Pretrained BERT (Pro-Trans model) Transformer encoding**
└── SeqVec.tsv
└── **Final embeddings in Pretrained CNN + biLSTM (SeqVec model) encoding**
distances/
├── OneHot.npy
│ └── **Cosine distance matrix from all protein pair embeddings retrieved from One-Hot encoding**
├── SGT.npy
│ └── **Cosine distance matrix from all protein pair embeddings retrieved from SGT encoding**
├── ProtBERT.npy
│ └── **Cosine distance matrix from all protein pair embeddings retrieved from BERT encoding**
└── SeqVec.npy
└── **Cosine distance matrix from all protein pair embeddings retrieved from SeqVec encoding**
proteins/
├── protein_list/
│ ├── protein_names.csv
│ │ └── **List of total protein IDs (Uniprot ID), name and descriptions.**
│ ├── proteins.csv
│ │ └── **Total protein IDs retrieved for embeddings in all embedding methods except SGT after length filtering**
│ └── proteins_sgt.csv
│ └── **Total protein IDs retrieved for embeddings generation in SGT**
├── proteins_DR/
│ ├── CSBJ-ND.tsv
│ │ └── **List of total protein pairs linked to disease pairs that share the same repurposed pharmaceutical indication, from repurpousing data gathered in Literature[1]**
│ └── repoDB.tsv
│ └── **List of total protein pairs linked to disease pairs that share the same repurposed pharmaceutical indication, from repurpousing data gathered from repoDB[2]**
└── proteins_DR_different_class/
├── CSBJ-ND.tsv
│ └── **Protein pair lists from protein_DR/CSBJ-ND.tsv, limited to protein pairs that do not share the same protein functional class (Filtered subset)**
└── repoDB.tsv
└── **Protein pair lists from protein_DR/repoDB.tsv, limited to protein pairs that do not share the same protein functional class (Filtered subset)**
\ No newline at end of file
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment