AIDO.StructureTokenizer: Bridging Protein Structure and Sequence Modeling
Type | research |
---|---|
Area | AIDigital Twin |
Published(YearMonth) | 2412 |
Source | https://www.biorxiv.org/content/10.1101/2024.12.02.626366v1 |
Tag | newsletter |
Checkbox | |
Date(of entry) |
AIDO.StructureTokenizer (AIDO.St) introduces a novel approach to indexing, searching, and generating protein structures, leveraging the vast resources of databases like the AlphaFold Protein Structure Database. This 300M-parameter VQ-VAE-based model features an equivariant encoder for discretizing protein structures into tokens and an invariant decoder for reconstructing structures from these tokens. AIDO.St strikes a balance between the encoder's ability to localize and retrieve structural information and the decoder's reconstruction precision. Comparisons with Foldseek, ProToken, and ESM3 highlight its superior protein structure retrieval and reconstruction capabilities. By aligning structural tokens with protein sequence language models, AIDO.St enhances structure prediction accuracy, offering a robust tool for integrating protein structure and sequence modalities. Models and code are openly accessible, promoting advancements in protein research and bioinformatics.