README.md 4.82 KB
Newer Older
Lucia Prieto's avatar
Lucia Prieto committed
1 2
# Embeddings in DR

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
3 4

1.  Supplementary `data.xlsx` descriptionExcel file, available in the present Gitlab repository, with different sheets that contain the data used in the analysis.
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
5

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
6
<n>
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
7

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
8
- **DR cases – repoDB**
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
9

Esther  Ugarte Carro's avatar
Esther Ugarte Carro committed
10
Drug repurposing cases extracted from the repoDB database. We excluded those cases where the disease and the drug shared the drug target protein. The GDA score wasdepicted too.
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
11

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
12
<n>
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
13

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
14
- **DR cases – Literature**
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
15

Esther  Ugarte Carro's avatar
Esther Ugarte Carro committed
16
Drug repurposing cases selected from the Literature1. We excluded those cases where the disease and the drug shared the drug target protein. Moreover, we only consideredthe new disease for which the drug was repositioned and not the original one for whichit was indicated. Cases where the disease and the drug shared the target protein were excluded. The GDA score was depicted too.
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
17

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
18
<n>
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
19

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
20
- **PP – repoDB**
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
21

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
22
Unique protein pairs from the drug repurposing repoDB cases.
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
23

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
24
<n>
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
25

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
26
- **PP – Literature**
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
27

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
28
Unique protein pairs from the drug repurposing literature cases.
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
29

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
30
<n>
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
31

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
32
- **PP by class – repoDB**
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
33

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
34
Protein pairs filtered by PANTHERdb class from the drug repurposing repoDB cases. We made sure the pairs did not share class or classes.
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
35

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
36
<n>
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
37

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
38
- **PP by class – Literature**
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
39

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
40
Protein pairs filtered by PANTHERdb class from the drug repurposong literature cases.We made sure the pairs did not share class or classes.
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
41

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
42
<n>
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
43

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
44
- **PP – Distances**
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
45

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
46
The distance value for each protein pair is included for every embedding method. We indicated if the protein pair belonged to repoDB, Literature or both datasets.
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
47

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
48
<n>
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
49

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
50
- **PP by class – Distances**
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
51

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
52 53
The distance value for each protein pair filtered by PANTHERdb class is included forevery embedding method. We indicated if the protein pair belongede to repoDB, Literature or both datasets.

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
54
<n>
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
55

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
56 57
---

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
58 59
2. Supplementary files for protein data used in the study, sequence embeddings from the four reviewed methods and protein pair distances, available in [](https://drive.upm.es/s/egBAv71on4AgBdn?path=%2Fdata)

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
60 61
<n>

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
62 63

*  **File folder structure**
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
64

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
65
<n>
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
66

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
67
```
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
68
embeddings/
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108
├── Global_embedding.ipynb 
│   └── **Script to execute protein sequence embedding retrieval for all methods**
├── OneHot.tsv
│   └── **Final embeddings in One Hot encoding**
├── SGT.tsv 
│   └── **Final embeddings in Sequence Graph Transform encoding**
├── ProtBERT.tsv
│   └── **Final embeddings in Pretrained BERT (Pro-Trans model) Transformer encoding**
└── SeqVec.tsv
    └── **Final embeddings in Pretrained CNN + biLSTM (SeqVec model) encoding**
    
    
distances/
├── OneHot.npy
│   └── **Cosine distance matrix from all protein pair embeddings retrieved from One-Hot encoding**
├── SGT.npy
│   └── **Cosine distance matrix from all protein pair embeddings retrieved from SGT encoding**
├── ProtBERT.npy
│   └── **Cosine distance matrix from all protein pair embeddings retrieved from BERT encoding**
└── SeqVec.npy
    └── **Cosine distance matrix from all protein pair embeddings retrieved from SeqVec encoding**
    
    
proteins/
├── protein_list/
│   ├── protein_names.csv
│   │   └── **List of total protein IDs (Uniprot ID), name and descriptions.**
│   ├── proteins.csv
│   │   └── **Total protein IDs retrieved for embeddings in all embedding methods except SGT after length filtering**
│   └── proteins_sgt.csv
│       └── **Total protein IDs retrieved for embeddings generation in SGT**
├── proteins_DR/
│   ├── CSBJ-ND.tsv
│   │   └── **List of total protein pairs linked to disease pairs that share the same repurposed pharmaceutical indication, from repurpousing data gathered in Literature[1]**
│   └── repoDB.tsv
│       └── **List of total protein pairs linked to disease pairs that share the same repurposed pharmaceutical indication, from repurpousing data gathered from repoDB[2]**
└── proteins_DR_different_class/
    ├── CSBJ-ND.tsv
    │   └── **Protein pair lists from protein_DR/CSBJ-ND.tsv, limited to protein pairs that do not share the same protein functional class (Filtered subset)**
    └── repoDB.tsv
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
109 110 111 112 113 114
        └── **Protein pair lists from protein_DR/repoDB.tsv, limited to protein pairs that do not share the same protein functional class (Filtered subset
```
---

<n>

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
115
**[1].-** *Prieto Santamaría L, Ugarte Carro E, Díaz Uzquiano M, Menasalvas Ruiz E, Pérez Gallardo Y, Rodríguez-González A. A data-driven methodology towards evaluating the potential of drug repurposing hypotheses. Comput Struct Biotechnol J. 2021 Aug 9;19:4559-4573. doi: 10.1016/j.csbj.2021.08.003. PMID: 34471499; PMCID: PMC8387760.*
NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
116 117 118

<n>

NATALIA GARCIA SANCHEZ's avatar
NATALIA GARCIA SANCHEZ committed
119
**[2].-** *Brown AS, Patel CJ. A standard database for drug repositioning. Sci Data. 2017 Mar 14;4:170029. doi: 10.1038/sdata.2017.29. PMID: 28291243; PMCID: PMC5349249.*