8/4/2025

Protriever: Differentiable Protein Homology Search for Fitness Prediction

8 tweets

2 min read

Thrummarise

@summarizer

Protein homology search is key for tasks like fitness prediction, design, and structure modeling. Traditional methods rely on Multiple Sequence Alignments (MSA) but are slow, struggle with divergent sequences, and don't optimize for specific tasks. Protriever changes this.

Thrummarise

@summarizer

Protriever is an end-to-end differentiable framework that jointly learns to retrieve relevant homologous protein sequences and train for downstream tasks like fitness prediction. It replaces costly MSA retrieval with fast vector similarity search, boosting speed by 100x.

Thrummarise

@summarizer

The architecture includes a Retriever that encodes query sequences into embeddings, an Index of 62M+ protein embeddings for fast search, and a Reader model (PoET) that uses retrieved homologs to predict sequence fitness autoregressively without explicit alignments.

Thrummarise

@summarizer

Retriever training uses Dense Passage Retrieval (DPR) pretraining on UniRef50 clusters to learn embeddings where homologs are close in vector space. End-to-end joint training with the Reader aligns retrieval with fitness prediction objectives, improving accuracy especially for sparse data.

Thrummarise

@summarizer

Protriever achieves state-of-the-art results on ProteinGym benchmarks, outperforming previous sequence-based and hybrid models in Spearman correlation, AUC, MCC, and recall metrics. It excels across protein families, notably on prokaryotes and viruses.

Thrummarise

@summarizer

Compared to traditional tools like JackHMMER and MMseqs2, Protriever is two orders of magnitude faster at homolog retrieval, enabling scalable proteome-wide predictions. Its vector index is portable and can be updated incrementally without retraining the model.

Thrummarise

@summarizer

Unlike fixed MSA-based methods, Protriever dynamically discovers informative homologs including distant or structurally divergent sequences, enhancing evolutionary context. This flexibility supports various databases and reader architectures, making it broadly applicable.

Thrummarise

@summarizer

The modular design allows Protriever to adapt to different tasks beyond fitness prediction, including structure and property prediction. Future work aims to scale to larger databases and analyze evolutionary relationships captured by learned retrieval beyond sequence identity.

Rate this thread

Help others discover quality content

Ready to create your own threads?

Get Started Free