# Embeddings in DR 1. Supplementary `data.xlsx` descriptionExcel file, available in the present Gitlab repository, with different sheets that contain the data used in the analysis. - **DR cases – repoDB** Drug repurposing cases extracted from the repoDB database. We excluded those cases where the disease and the drug shared the drug target protein. The GDA score wasdepicted too. - **DR cases – Literature** Drug repurposing cases selected from the Literature1. We excluded those cases where the disease and the drug shared the drug target protein. Moreover, we only consideredthe new disease for which the drug was repositioned and not the original one for whichit was indicated. Cases where the disease and the drug shared the target protein were excluded. The GDA score was depicted too. - **PP – repoDB** Unique protein pairs from the drug repurposing repoDB cases. - **PP – Literature** Unique protein pairs from the drug repurposing literature cases. - **PP by class – repoDB** Protein pairs filtered by PANTHERdb class from the drug repurposing repoDB cases. We made sure the pairs did not share class or classes. - **PP by class – Literature** Protein pairs filtered by PANTHERdb class from the drug repurposong literature cases.We made sure the pairs did not share class or classes. - **PP – Distances** The distance value for each protein pair is included for every embedding method. We indicated if the protein pair belonged to repoDB, Literature or both datasets. - **PP by class – Distances** The distance value for each protein pair filtered by PANTHERdb class is included forevery embedding method. We indicated if the protein pair belongede to repoDB, Literature or both datasets. 2. Supplementary files for protein data used in the study, sequence embeddings from the four reviewed methods and protein pair distances, available in [](https://drive.upm.es/s/egBAv71on4AgBdn?path=%2Fdata) **File folder structure** ``` embeddings/ ├── Global_embedding.ipynb │ └── **Script to execute protein sequence embedding retrieval for all methods** ├── OneHot.tsv │ └── **Final embeddings in One Hot encoding** ├── SGT.tsv │ └── **Final embeddings in Sequence Graph Transform encoding** ├── ProtBERT.tsv │ └── **Final embeddings in Pretrained BERT (Pro-Trans model) Transformer encoding** └── SeqVec.tsv └── **Final embeddings in Pretrained CNN + biLSTM (SeqVec model) encoding** distances/ ├── OneHot.npy │ └── **Cosine distance matrix from all protein pair embeddings retrieved from One-Hot encoding** ├── SGT.npy │ └── **Cosine distance matrix from all protein pair embeddings retrieved from SGT encoding** ├── ProtBERT.npy │ └── **Cosine distance matrix from all protein pair embeddings retrieved from BERT encoding** └── SeqVec.npy └── **Cosine distance matrix from all protein pair embeddings retrieved from SeqVec encoding** proteins/ ├── protein_list/ │ ├── protein_names.csv │ │ └── **List of total protein IDs (Uniprot ID), name and descriptions.** │ ├── proteins.csv │ │ └── **Total protein IDs retrieved for embeddings in all embedding methods except SGT after length filtering** │ └── proteins_sgt.csv │ └── **Total protein IDs retrieved for embeddings generation in SGT** ├── proteins_DR/ │ ├── CSBJ-ND.tsv │ │ └── **List of total protein pairs linked to disease pairs that share the same repurposed pharmaceutical indication, from repurpousing data gathered in Literature[1]** │ └── repoDB.tsv │ └── **List of total protein pairs linked to disease pairs that share the same repurposed pharmaceutical indication, from repurpousing data gathered from repoDB[2]** └── proteins_DR_different_class/ ├── CSBJ-ND.tsv │ └── **Protein pair lists from protein_DR/CSBJ-ND.tsv, limited to protein pairs that do not share the same protein functional class (Filtered subset)** └── repoDB.tsv └── **Protein pair lists from protein_DR/repoDB.tsv, limited to protein pairs that do not share the same protein functional class (Filtered subset ``` --- **[1].-** Prieto Santamaría L, Ugarte Carro E, Díaz Uzquiano M, Menasalvas Ruiz E, Pérez Gallardo Y, Rodríguez-González A. A data-driven methodology towards evaluating the potential of drug repurposing hypotheses. Comput Struct Biotechnol J. 2021 Aug 9;19:4559-4573. doi: 10.1016/j.csbj.2021.08.003. PMID: 34471499; PMCID: PMC8387760. **[2].-** Brown AS, Patel CJ. A standard database for drug repositioning. Sci Data. 2017 Mar 14;4:170029. doi: 10.1038/sdata.2017.29. PMID: 28291243; PMCID: PMC5349249.