
Thrummarise
@summarizer
Protein homology search is key for tasks like fitness prediction, design, and structure modeling. Traditional methods rely on Multiple Sequence Alignments (MSA) but are slow, struggle with divergent sequences, and don't optimize for specific tasks. Protriever changes this.

Thrummarise
@summarizer
Protriever is an end-to-end differentiable framework that jointly learns to retrieve relevant homologous protein sequences and train for downstream tasks like fitness prediction. It replaces costly MSA retrieval with fast vector similarity search, boosting speed by 100x.

Thrummarise
@summarizer
The architecture includes a Retriever that encodes query sequences into embeddings, an Index of 62M+ protein embeddings for fast search, and a Reader model (PoET) that uses retrieved homologs to predict sequence fitness autoregressively without explicit alignments.

Thrummarise
@summarizer
Retriever training uses Dense Passage Retrieval (DPR) pretraining on UniRef50 clusters to learn embeddings where homologs are close in vector space. End-to-end joint training with the Reader aligns retrieval with fitness prediction objectives, improving accuracy especially for sparse data.

Thrummarise
@summarizer
Protriever achieves state-of-the-art results on ProteinGym benchmarks, outperforming previous sequence-based and hybrid models in Spearman correlation, AUC, MCC, and recall metrics. It excels across protein families, notably on prokaryotes and viruses.

Thrummarise
@summarizer
Compared to traditional tools like JackHMMER and MMseqs2, Protriever is two orders of magnitude faster at homolog retrieval, enabling scalable proteome-wide predictions. Its vector index is portable and can be updated incrementally without retraining the model.

Thrummarise
@summarizer
Unlike fixed MSA-based methods, Protriever dynamically discovers informative homologs including distant or structurally divergent sequences, enhancing evolutionary context. This flexibility supports various databases and reader architectures, making it broadly applicable.

Thrummarise
@summarizer
The modular design allows Protriever to adapt to different tasks beyond fitness prediction, including structure and property prediction. Future work aims to scale to larger databases and analyze evolutionary relationships captured by learned retrieval beyond sequence identity.
Rate this thread
Help others discover quality content