README.md 4.79 KB
Newer Older
Lucia Prieto's avatar
Lucia Prieto committed
1 2
# Embeddings in DR

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
3 4

1.  Supplementary `data.xlsx` descriptionExcel file, available in the present Gitlab repository, with different sheets that contain the data used in the analysis.
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
5
<n>
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
6
- **DR cases – repoDB**
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
7

Esther  Ugarte Carro's avatar
Esther Ugarte Carro committed
8
Drug repurposing cases extracted from the repoDB database. We excluded those cases where the disease and the drug shared the drug target protein. The GDA score wasdepicted too.
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
9

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
10
<n>
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
11

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
12
- **DR cases – Literature**
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
13

Esther  Ugarte Carro's avatar
Esther Ugarte Carro committed
14
Drug repurposing cases selected from the Literature1. We excluded those cases where the disease and the drug shared the drug target protein. Moreover, we only consideredthe new disease for which the drug was repositioned and not the original one for whichit was indicated. Cases where the disease and the drug shared the target protein were excluded. The GDA score was depicted too.
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
15

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
16
<n>
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
17

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
18
- **PP – repoDB**
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
19

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
20
Unique protein pairs from the drug repurposing repoDB cases.
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
21

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
22
<n>
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
23

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
24
- **PP – Literature**
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
25

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
26
Unique protein pairs from the drug repurposing literature cases.
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
27

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
28
<n>
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
29

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
30
- **PP by class – repoDB**
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
31

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
32
Protein pairs filtered by PANTHERdb class from the drug repurposing repoDB cases. We made sure the pairs did not share class or classes.
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
33

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
34
<n>
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
35

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
36
- **PP by class – Literature**
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
37

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
38
Protein pairs filtered by PANTHERdb class from the drug repurposong literature cases.We made sure the pairs did not share class or classes.
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
39

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
40
<n>
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
41

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
42
- **PP – Distances**
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
43

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
44
The distance value for each protein pair is included for every embedding method. We indicated if the protein pair belonged to repoDB, Literature or both datasets.
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
45

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
46
<n>
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
47

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
48
- **PP by class – Distances**
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
49

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
50 51 52 53 54 55 56
The distance value for each protein pair filtered by PANTHERdb class is included forevery embedding method. We indicated if the protein pair belongede to repoDB, Literature or both datasets.


2. Supplementary files for protein data used in the study, sequence embeddings from the four reviewed methods and protein pair distances, available in [](https://drive.upm.es/s/egBAv71on4AgBdn?path=%2Fdata)

**File folder structure **

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
57
```
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98
embeddings/
├── Global_embedding.ipynb 
│   └── **Script to execute protein sequence embedding retrieval for all methods**
├── OneHot.tsv
│   └── **Final embeddings in One Hot encoding**
├── SGT.tsv 
│   └── **Final embeddings in Sequence Graph Transform encoding**
├── ProtBERT.tsv
│   └── **Final embeddings in Pretrained BERT (Pro-Trans model) Transformer encoding**
└── SeqVec.tsv
    └── **Final embeddings in Pretrained CNN + biLSTM (SeqVec model) encoding**
    
    
distances/
├── OneHot.npy
│   └── **Cosine distance matrix from all protein pair embeddings retrieved from One-Hot encoding**
├── SGT.npy
│   └── **Cosine distance matrix from all protein pair embeddings retrieved from SGT encoding**
├── ProtBERT.npy
│   └── **Cosine distance matrix from all protein pair embeddings retrieved from BERT encoding**
└── SeqVec.npy
    └── **Cosine distance matrix from all protein pair embeddings retrieved from SeqVec encoding**
    
    
proteins/
├── protein_list/
│   ├── protein_names.csv
│   │   └── **List of total protein IDs (Uniprot ID), name and descriptions.**
│   ├── proteins.csv
│   │   └── **Total protein IDs retrieved for embeddings in all embedding methods except SGT after length filtering**
│   └── proteins_sgt.csv
│       └── **Total protein IDs retrieved for embeddings generation in SGT**
├── proteins_DR/
│   ├── CSBJ-ND.tsv
│   │   └── **List of total protein pairs linked to disease pairs that share the same repurposed pharmaceutical indication, from repurpousing data gathered in Literature[1]**
│   └── repoDB.tsv
│       └── **List of total protein pairs linked to disease pairs that share the same repurposed pharmaceutical indication, from repurpousing data gathered from repoDB[2]**
└── proteins_DR_different_class/
    ├── CSBJ-ND.tsv
    │   └── **Protein pair lists from protein_DR/CSBJ-ND.tsv, limited to protein pairs that do not share the same protein functional class (Filtered subset)**
    └── repoDB.tsv
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
99 100 101 102 103 104 105 106 107 108 109
        └── **Protein pair lists from protein_DR/repoDB.tsv, limited to protein pairs that do not share the same protein functional class (Filtered subset
```
---

<n>

**[1].-** Prieto Santamaría L, Ugarte Carro E, Díaz Uzquiano M, Menasalvas Ruiz E, Pérez Gallardo Y, Rodríguez-González A. A data-driven methodology towards evaluating the potential of drug repurposing hypotheses. Comput Struct Biotechnol J. 2021 Aug 9;19:4559-4573. doi: 10.1016/j.csbj.2021.08.003. PMID: 34471499; PMCID: PMC8387760.

<n>

**[2].-** Brown AS, Patel CJ. A standard database for drug repositioning. Sci Data. 2017 Mar 14;4:170029. doi: 10.1038/sdata.2017.29. PMID: 28291243; PMCID: PMC5349249.