From 34d0481cb17112e661965d1a5962cb231e92a6df Mon Sep 17 00:00:00 2001 From: NATALIA GARCIA SANCHEZ Date: Sun, 25 Feb 2024 23:26:52 +0000 Subject: [PATCH] Update README.md --- README.md | 53 +++++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 51 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index d4d1755..30f2df0 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,7 @@ # Embeddings in DR -Additional `data.xlsx` descriptionExcel file with different sheets that contain the data used in the analysis. + +1. Supplementary `data.xlsx` descriptionExcel file, available in the present Gitlab repository, with different sheets that contain the data used in the analysis. - **DR cases – repoDB** @@ -46,4 +47,52 @@ The distance value for each protein pair is included for every embedding method. - **PP by class – Distances** -The distance value for each protein pair filtered by PANTHERdb class is included forevery embedding method. We indicated if the protein pair belongede to repoDB, Literature or both datasets. \ No newline at end of file +The distance value for each protein pair filtered by PANTHERdb class is included forevery embedding method. We indicated if the protein pair belongede to repoDB, Literature or both datasets. + + +2. Supplementary files for protein data used in the study, sequence embeddings from the four reviewed methods and protein pair distances, available in [](https://drive.upm.es/s/egBAv71on4AgBdn?path=%2Fdata) + +**File folder structure ** + +embeddings/ +├── Global_embedding.ipynb +│ └── **Script to execute protein sequence embedding retrieval for all methods** +├── OneHot.tsv +│ └── **Final embeddings in One Hot encoding** +├── SGT.tsv +│ └── **Final embeddings in Sequence Graph Transform encoding** +├── ProtBERT.tsv +│ └── **Final embeddings in Pretrained BERT (Pro-Trans model) Transformer encoding** +└── SeqVec.tsv + └── **Final embeddings in Pretrained CNN + biLSTM (SeqVec model) encoding** + + +distances/ +├── OneHot.npy +│ └── **Cosine distance matrix from all protein pair embeddings retrieved from One-Hot encoding** +├── SGT.npy +│ └── **Cosine distance matrix from all protein pair embeddings retrieved from SGT encoding** +├── ProtBERT.npy +│ └── **Cosine distance matrix from all protein pair embeddings retrieved from BERT encoding** +└── SeqVec.npy + └── **Cosine distance matrix from all protein pair embeddings retrieved from SeqVec encoding** + + +proteins/ +├── protein_list/ +│ ├── protein_names.csv +│ │ └── **List of total protein IDs (Uniprot ID), name and descriptions.** +│ ├── proteins.csv +│ │ └── **Total protein IDs retrieved for embeddings in all embedding methods except SGT after length filtering** +│ └── proteins_sgt.csv +│ └── **Total protein IDs retrieved for embeddings generation in SGT** +├── proteins_DR/ +│ ├── CSBJ-ND.tsv +│ │ └── **List of total protein pairs linked to disease pairs that share the same repurposed pharmaceutical indication, from repurpousing data gathered in Literature[1]** +│ └── repoDB.tsv +│ └── **List of total protein pairs linked to disease pairs that share the same repurposed pharmaceutical indication, from repurpousing data gathered from repoDB[2]** +└── proteins_DR_different_class/ + ├── CSBJ-ND.tsv + │ └── **Protein pair lists from protein_DR/CSBJ-ND.tsv, limited to protein pairs that do not share the same protein functional class (Filtered subset)** + └── repoDB.tsv + └── **Protein pair lists from protein_DR/repoDB.tsv, limited to protein pairs that do not share the same protein functional class (Filtered subset)** \ No newline at end of file -- 2.24.1