Commit 06284d0c authored by Lucia Prieto's avatar Lucia Prieto

Update README.md

parent b3d17a5c
# Embeddings in DR
1. Supplementary `data.xlsx` descriptionExcel file, available in the present Gitlab repository, with different sheets that contain the data used in the analysis.
1. Supplementary [`data.xlsx](https://medal.ctb.upm.es/internal/gitlab/disnet/sequences/embeddings-in-dr/blob/master/data.xlsx)` descriptionExcel file, available in the present Gitlab repository, with different sheets that contain the data used in the analysis.
<n>
......@@ -66,47 +66,49 @@ The distance value for each protein pair filtered by PANTHERdb class is included
```
embeddings/
├── Global_embedding.ipynb
│ └── **Script to execute protein sequence embedding retrieval for all methods**
├── OneHot.tsv
│ └── **Final embeddings in One Hot encoding**
│ └── Final embeddings in One Hot encoding.
├── SGT.tsv
│ └── **Final embeddings in Sequence Graph Transform encoding**
│ └── Final embeddings in Sequence Graph Transform encoding.
├── ProtBERT.tsv
│ └── **Final embeddings in Pretrained BERT (Pro-Trans model) Transformer encoding**
│ └── Final embeddings in Pretrained BERT (Pro-Trans model) Transformer encoding.
└── SeqVec.tsv
└── **Final embeddings in Pretrained CNN + biLSTM (SeqVec model) encoding**
└── Final embeddings in Pretrained CNN + biLSTM (SeqVec model) encoding.
distances/
├── OneHot.npy
│ └── **Cosine distance matrix from all protein pair embeddings retrieved from One-Hot encoding**
│ └── Cosine distance matrix from all protein pair embeddings retrieved from One-Hot encoding. Rows and columns are ordered according to the protein lists provided in the next directory (proteins.csv).
├── SGT.npy
│ └── **Cosine distance matrix from all protein pair embeddings retrieved from SGT encoding**
│ └── Cosine distance matrix from all protein pair embeddings retrieved from SGT encoding. Rows and columns are ordered according to the protein lists provided in the next directory (proteins_sgt.csv).
├── ProtBERT.npy
│ └── **Cosine distance matrix from all protein pair embeddings retrieved from BERT encoding**
│ └── Cosine distance matrix from all protein pair embeddings retrieved from BERT encoding. Rows and columns are ordered according to the protein lists provided in the next directory (proteins.csv).
└── SeqVec.npy
└── **Cosine distance matrix from all protein pair embeddings retrieved from SeqVec encoding**
└── Cosine distance matrix from all protein pair embeddings retrieved from SeqVec encoding. Rows and columns are ordered according to the protein lists provided in the next directory (proteins.csv).
proteins/
├── protein_list/
│ ├── protein_names.csv
│ │ └── **List of total protein IDs (Uniprot ID), name and descriptions.**
│ │ └── List of total protein IDs (Uniprot ID), name and descriptions.
│ ├── proteins.csv
│ │ └── **Total protein IDs retrieved for embeddings in all embedding methods except SGT after length filtering**
│ │ └── Total protein IDs retrieved for embeddings in all embedding methods except SGT after length filtering.
│ └── proteins_sgt.csv
│ └── **Total protein IDs retrieved for embeddings generation in SGT**
│ └── Total protein IDs retrieved for embeddings generation in SGT.
├── proteins_DR/
│ ├── CSBJ-ND.tsv
│ │ └── **List of total protein pairs linked to disease pairs that share the same repurposed pharmaceutical indication, from repurpousing data gathered in Literature[1]**
│ │ └── List of total protein pairs linked to disease pairs that share the same repurposed pharmaceutical indication, from repurpousing data gathered in Literature [1].
│ └── repoDB.tsv
│ └── **List of total protein pairs linked to disease pairs that share the same repurposed pharmaceutical indication, from repurpousing data gathered from repoDB[2]**
│ └── List of total protein pairs linked to disease pairs that share the same repurposed pharmaceutical indication, from repurpousing data gathered from RepoDB [2].
└── proteins_DR_different_class/
├── CSBJ-ND.tsv
│ └── **Protein pair lists from protein_DR/CSBJ-ND.tsv, limited to protein pairs that do not share the same protein functional class (Filtered subset)**
│ └── Protein pair lists from protein_DR/CSBJ-ND.tsv, limited to protein pairs that do not share the same protein functional class (Filtered subset).
└── repoDB.tsv
└── **Protein pair lists from protein_DR/repoDB.tsv, limited to protein pairs that do not share the same protein functional class (Filtered subset
└── Protein pair lists from protein_DR/repoDB.tsv, limited to protein pairs that do not share the same protein functional class (Filtered subset).
DR/
└── disease_protein.csv
└── List of disease - protein associations to be used in the search for new repurposing hypotheses. These data comes from DisGeNET.
```
---
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment