README.md 4.85 KB
Newer Older
Lucia Prieto's avatar
Lucia Prieto committed
1 2
# Embeddings in DR

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
3 4

1.  Supplementary `data.xlsx` descriptionExcel file, available in the present Gitlab repository, with different sheets that contain the data used in the analysis.
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
5

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
6
<n>
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
7
- **DR cases – repoDB**
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
8

Esther  Ugarte Carro's avatar
Esther Ugarte Carro committed
9
Drug repurposing cases extracted from the repoDB database. We excluded those cases where the disease and the drug shared the drug target protein. The GDA score wasdepicted too.
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
10

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
11
<n>
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
12

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
13
- **DR cases – Literature**
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
14

Esther  Ugarte Carro's avatar
Esther Ugarte Carro committed
15
Drug repurposing cases selected from the Literature1. We excluded those cases where the disease and the drug shared the drug target protein. Moreover, we only consideredthe new disease for which the drug was repositioned and not the original one for whichit was indicated. Cases where the disease and the drug shared the target protein were excluded. The GDA score was depicted too.
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
16

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
17
<n>
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
18

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
19
- **PP – repoDB**
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
20

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
21
Unique protein pairs from the drug repurposing repoDB cases.
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
22

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
23
<n>
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
24

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
25
- **PP – Literature**
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
26

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
27
Unique protein pairs from the drug repurposing literature cases.
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
28

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
29
<n>
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
30

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
31
- **PP by class – repoDB**
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
32

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
33
Protein pairs filtered by PANTHERdb class from the drug repurposing repoDB cases. We made sure the pairs did not share class or classes.
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
34

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
35
<n>
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
36

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
37
- **PP by class – Literature**
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
38

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
39
Protein pairs filtered by PANTHERdb class from the drug repurposong literature cases.We made sure the pairs did not share class or classes.
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
40

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
41
<n>
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
42

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
43
- **PP – Distances**
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
44

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
45
The distance value for each protein pair is included for every embedding method. We indicated if the protein pair belonged to repoDB, Literature or both datasets.
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
46

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
47
<n>
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
48

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
49
- **PP by class – Distances**
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
50

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
51 52
The distance value for each protein pair filtered by PANTHERdb class is included forevery embedding method. We indicated if the protein pair belongede to repoDB, Literature or both datasets.

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
53
<n>
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
54

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
55 56
---

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
57 58
2. Supplementary files for protein data used in the study, sequence embeddings from the four reviewed methods and protein pair distances, available in [](https://drive.upm.es/s/egBAv71on4AgBdn?path=%2Fdata)

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
59 60
<n>

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
61 62

*  **File folder structure**
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
63

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
64
<n>
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
65

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
66
```
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
67
<span style="color:red;">embeddings/ </span>
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107
├── Global_embedding.ipynb 
│   └── **Script to execute protein sequence embedding retrieval for all methods**
├── OneHot.tsv
│   └── **Final embeddings in One Hot encoding**
├── SGT.tsv 
│   └── **Final embeddings in Sequence Graph Transform encoding**
├── ProtBERT.tsv
│   └── **Final embeddings in Pretrained BERT (Pro-Trans model) Transformer encoding**
└── SeqVec.tsv
    └── **Final embeddings in Pretrained CNN + biLSTM (SeqVec model) encoding**
    
    
distances/
├── OneHot.npy
│   └── **Cosine distance matrix from all protein pair embeddings retrieved from One-Hot encoding**
├── SGT.npy
│   └── **Cosine distance matrix from all protein pair embeddings retrieved from SGT encoding**
├── ProtBERT.npy
│   └── **Cosine distance matrix from all protein pair embeddings retrieved from BERT encoding**
└── SeqVec.npy
    └── **Cosine distance matrix from all protein pair embeddings retrieved from SeqVec encoding**
    
    
proteins/
├── protein_list/
│   ├── protein_names.csv
│   │   └── **List of total protein IDs (Uniprot ID), name and descriptions.**
│   ├── proteins.csv
│   │   └── **Total protein IDs retrieved for embeddings in all embedding methods except SGT after length filtering**
│   └── proteins_sgt.csv
│       └── **Total protein IDs retrieved for embeddings generation in SGT**
├── proteins_DR/
│   ├── CSBJ-ND.tsv
│   │   └── **List of total protein pairs linked to disease pairs that share the same repurposed pharmaceutical indication, from repurpousing data gathered in Literature[1]**
│   └── repoDB.tsv
│       └── **List of total protein pairs linked to disease pairs that share the same repurposed pharmaceutical indication, from repurpousing data gathered from repoDB[2]**
└── proteins_DR_different_class/
    ├── CSBJ-ND.tsv
    │   └── **Protein pair lists from protein_DR/CSBJ-ND.tsv, limited to protein pairs that do not share the same protein functional class (Filtered subset)**
    └── repoDB.tsv
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
108 109 110 111 112 113
        └── **Protein pair lists from protein_DR/repoDB.tsv, limited to protein pairs that do not share the same protein functional class (Filtered subset
```
---

<n>

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
114
**[1].-** *Prieto Santamaría L, Ugarte Carro E, Díaz Uzquiano M, Menasalvas Ruiz E, Pérez Gallardo Y, Rodríguez-González A. A data-driven methodology towards evaluating the potential of drug repurposing hypotheses. Comput Struct Biotechnol J. 2021 Aug 9;19:4559-4573. doi: 10.1016/j.csbj.2021.08.003. PMID: 34471499; PMCID: PMC8387760.*
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
115 116 117

<n>

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
118
**[2].-** *Brown AS, Patel CJ. A standard database for drug repositioning. Sci Data. 2017 Mar 14;4:170029. doi: 10.1038/sdata.2017.29. PMID: 28291243; PMCID: PMC5349249.*