Commit 06284d0c authored by Lucia Prieto's avatar Lucia Prieto

Update README.md

parent b3d17a5c
# Embeddings in DR # Embeddings in DR
1. Supplementary `data.xlsx` descriptionExcel file, available in the present Gitlab repository, with different sheets that contain the data used in the analysis. 1. Supplementary [`data.xlsx](https://medal.ctb.upm.es/internal/gitlab/disnet/sequences/embeddings-in-dr/blob/master/data.xlsx)` descriptionExcel file, available in the present Gitlab repository, with different sheets that contain the data used in the analysis.
<n> <n>
...@@ -66,47 +66,49 @@ The distance value for each protein pair filtered by PANTHERdb class is included ...@@ -66,47 +66,49 @@ The distance value for each protein pair filtered by PANTHERdb class is included
``` ```
embeddings/ embeddings/
├── Global_embedding.ipynb
│ └── **Script to execute protein sequence embedding retrieval for all methods**
├── OneHot.tsv ├── OneHot.tsv
│ └── **Final embeddings in One Hot encoding** │ └── Final embeddings in One Hot encoding.
├── SGT.tsv ├── SGT.tsv
│ └── **Final embeddings in Sequence Graph Transform encoding** │ └── Final embeddings in Sequence Graph Transform encoding.
├── ProtBERT.tsv ├── ProtBERT.tsv
│ └── **Final embeddings in Pretrained BERT (Pro-Trans model) Transformer encoding** │ └── Final embeddings in Pretrained BERT (Pro-Trans model) Transformer encoding.
└── SeqVec.tsv └── SeqVec.tsv
└── **Final embeddings in Pretrained CNN + biLSTM (SeqVec model) encoding** └── Final embeddings in Pretrained CNN + biLSTM (SeqVec model) encoding.
distances/ distances/
├── OneHot.npy ├── OneHot.npy
│ └── **Cosine distance matrix from all protein pair embeddings retrieved from One-Hot encoding** │ └── Cosine distance matrix from all protein pair embeddings retrieved from One-Hot encoding. Rows and columns are ordered according to the protein lists provided in the next directory (proteins.csv).
├── SGT.npy ├── SGT.npy
│ └── **Cosine distance matrix from all protein pair embeddings retrieved from SGT encoding** │ └── Cosine distance matrix from all protein pair embeddings retrieved from SGT encoding. Rows and columns are ordered according to the protein lists provided in the next directory (proteins_sgt.csv).
├── ProtBERT.npy ├── ProtBERT.npy
│ └── **Cosine distance matrix from all protein pair embeddings retrieved from BERT encoding** │ └── Cosine distance matrix from all protein pair embeddings retrieved from BERT encoding. Rows and columns are ordered according to the protein lists provided in the next directory (proteins.csv).
└── SeqVec.npy └── SeqVec.npy
└── **Cosine distance matrix from all protein pair embeddings retrieved from SeqVec encoding** └── Cosine distance matrix from all protein pair embeddings retrieved from SeqVec encoding. Rows and columns are ordered according to the protein lists provided in the next directory (proteins.csv).
proteins/ proteins/
├── protein_list/ ├── protein_list/
│ ├── protein_names.csv │ ├── protein_names.csv
│ │ └── **List of total protein IDs (Uniprot ID), name and descriptions.** │ │ └── List of total protein IDs (Uniprot ID), name and descriptions.
│ ├── proteins.csv │ ├── proteins.csv
│ │ └── **Total protein IDs retrieved for embeddings in all embedding methods except SGT after length filtering** │ │ └── Total protein IDs retrieved for embeddings in all embedding methods except SGT after length filtering.
│ └── proteins_sgt.csv │ └── proteins_sgt.csv
│ └── **Total protein IDs retrieved for embeddings generation in SGT** │ └── Total protein IDs retrieved for embeddings generation in SGT.
├── proteins_DR/ ├── proteins_DR/
│ ├── CSBJ-ND.tsv │ ├── CSBJ-ND.tsv
│ │ └── **List of total protein pairs linked to disease pairs that share the same repurposed pharmaceutical indication, from repurpousing data gathered in Literature[1]** │ │ └── List of total protein pairs linked to disease pairs that share the same repurposed pharmaceutical indication, from repurpousing data gathered in Literature [1].
│ └── repoDB.tsv │ └── repoDB.tsv
│ └── **List of total protein pairs linked to disease pairs that share the same repurposed pharmaceutical indication, from repurpousing data gathered from repoDB[2]** │ └── List of total protein pairs linked to disease pairs that share the same repurposed pharmaceutical indication, from repurpousing data gathered from RepoDB [2].
└── proteins_DR_different_class/ └── proteins_DR_different_class/
├── CSBJ-ND.tsv ├── CSBJ-ND.tsv
│ └── **Protein pair lists from protein_DR/CSBJ-ND.tsv, limited to protein pairs that do not share the same protein functional class (Filtered subset)** │ └── Protein pair lists from protein_DR/CSBJ-ND.tsv, limited to protein pairs that do not share the same protein functional class (Filtered subset).
└── repoDB.tsv └── repoDB.tsv
└── **Protein pair lists from protein_DR/repoDB.tsv, limited to protein pairs that do not share the same protein functional class (Filtered subset └── Protein pair lists from protein_DR/repoDB.tsv, limited to protein pairs that do not share the same protein functional class (Filtered subset).
DR/
└── disease_protein.csv
└── List of disease - protein associations to be used in the search for new repurposing hypotheses. These data comes from DisGeNET.
``` ```
--- ---
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment