NatureLM: Deciphering the Language of Nature for Scientific Discovery

NatureLM Team
Microsoft Research AI for Science

NatureLM Highlights

NatureLM, developed by Microsoft Research AI for Science, is a groundbreaking sequence-based science foundation model designed to unify multiple scientific domains, including small molecules, materials, proteins, DNA and RNA. This innovative model leverages the "language of nature" to enable scientific discovery through text-based instructions.

Key Features:

  • Cross-Domain Integration: NatureLM excels at generating scientific entities across domains, bridging tasks that require multi-domain knowledge, such as designing ligands for protein receptors, generating RNAs tailored to RNA-binding proteins, and engineering guide RNAs for targeted CRISPR-based DNA editing.
  • Text-Driven Generation and Optimization: It generates and optimizes biological, chemical and material entities directly from text instructions.
  • Top Performance: NatureLM demonstrates top performance on many scientific tasks, often matching or surpassing state-of-the-art specialist models.
  • Scalability: NatureLM is available in different sizes (1 billion, 8 billion, and 46.7 billion parameters), with performance improving as the model size increases.
  • Versatile Applications: From drug discovery and material design to therapeutic protein development, NatureLM offers a unified approach to tackle complex scientific challenges.

NatureLM represents a significant advancement in the field of artificial intelligence, providing researchers with a powerful tool to drive innovation and accelerate scientific breakthroughs. By integrating knowledge across multiple scientific domains, NatureLM paves the way for new discoveries and advancements in various fields of science.

Examples

Researchers can utilize the instructions provided to interact with NatureLM.

Generate molecule with given text compound properties (Section 3.2)

Instruction: Generate a molecule with four hydrogen bond donors.

Response: C(C[C@@H](C(=O)O)N)CN=C(N)N

One can replace `four hydrogen bonds ` to other properties like `QED 0.5`, `TPSA 90`, `three hydrogen bond donors`, etc.

Protein sequence to full molecule design (Section 3.4)

Instruction: Produce a compound guided by the target. <protein>LALSLTADQMVSALL...SYDLLLEMLDAH</protein>

Response: <mol>CC1=C(c2cccc(O)c2)C(c2ccc(I)cc2)Oc2ccc(O)cc21</mol>

Protein sequence to molecular fragment generation (Section 3.4)

Instruction: Design a compound with reference to the target <protein>DTKEQRILR….EKAIYQGP</protein> and the fragment <fragA> O=c1[nH]cnc2c(O)cc([*:1])c([*:2])c12 </fragA>.

Response: <fragB>Fc1ccc([*:1])cc1.Fc1ccc([*:2])cc1</fragB>

Here, ''[*:digit]'' refers to the connection point of the molecular fragment.

Binding affinity improvement (Section 3.5)

Instruction: Improve the binding affinity on Uridine-cytidine kinase 2 of <mol>Cc1ccc(‑c2nc3c(c(SCC(=O)Nc4ccccc4)n2)Cc2cccc(C)c2O3)cc1</mol>.

Response: <mol>Cc1ccc(‑c2nc3c(c(SCC(=O)Nc4cccc(C(=O)O)c4)n2)Cc2cccc(C)c2O3)cc1</mol>

Metabolism optimization (Section 3.6)

Instruction: Transform <mol>COCCn1c(=O)c(-c2ccccc2)nc2cnc(N3CCOCC3)nc21</mol> to yield lower CYP1A2 inhibition and present a better metabolic safety record.

Response: <mol>COc1ccc(‑c2nc3cnc(N4CCOCC4)nc3n(Cc3cccc(OC)c3)c2=O)cc1</mol>

You can replace CYP to the enzymes you wish as input.

Reaction prediction (Section 3.7)

Instruction: Please suggest possible reactants for the given product

<product>CC(=O)c1ccc2c(ccn2C(=O)OC(C)(C)C)c1</product>.

Response: <reactant>CC(=O)c1ccc2[nH]ccc2c1.CC(C)(C)OC(=O)OC(=O)OC(C)(C)C</reactant>

Protein generation (Section 4.2)

Instruction: I require a stable protein sequence, kindly generate one.

Response:<protein>GSVEVKIGTDRTTASSQDEAEERARRAGARRVRIDGNEVEVEM</protein>


Instruction: Generate a soluble protein sequence.

Response:<protein>MSLPNIFVEELVHTPIEKQVIEIVERKGIGHPDSLADGMAEAMSRELSREYIRRFGAVLHHNTDETQIVAGRSNPQFGGGEVIEPIYVLLVGRATKFFNGEYIPTDKIALKAARDYIRQHMQNLDPELDVVFNVRLGEGSTDLQDVFRRKSGNVALANDTSFGIGFAPLSETERLVFNVERRIYEEFRKKNPAIGEDVKVMGLREKDRISLTIAAAFVDRYVANIKEYDAIKEELENFVKEISSEYTEREVEVFVNTADDYETGCVYLTVTGTSAENGDDGSVGRGNRCNGLITPGRPMSMEASSGKNPINHVGKIYNLLANQIAARIAEEVEGVEEVYVRILSQIGKPINEPKALSVQVIPKSGYDISKLERPARDIAEEMIANVGKITDMVIEGKVRTFEGHHHHHH</protein>


Instruction: Produce a protein sequence that is not soluble.

Response:<protein>MAFLSQLQGGFAQLKQKAEAAAPSIADSLNQLKEAPKFMESAFKTGLNKVKLDSLGKVLNVPSLGRTQSPIDIVPVITAFGEHLQNAHFEVTYESTGEFKAVNDGNSIWLMREGNSSELAISFLPEEQYHLDAVNFHWATEPMNGSEHTIGGVGYAGEMHLIHRNTRFATMADALKQPNGVIAIAVFLNESHDDNAVFSPLINLLPQVIYKGSECKLCSFDFQTFFPVAEKTKEFWMYEGSETTDPFRETVNWIVIRAALPISSHQLDKLREVRAGRYDEEFSDKVPMKPLRPIQNPSSRTIQSSFRSVAGAPDLGFRQ</protein>

Antibody design and optimization (Section 4.3)

Instruction: Using antigen <protein>TQVCTGTDMKLR…GESSEDCQS</protein> and antibody frameworks <antibody> IVLTQTPS…LAVYYC</antibody> and <antibody>FGGGTRLEIEVQ</antibody>, create the CDR3 regions.

Response: <antibody>QQYSNYPWT</antibody>

Composition to material generation (Section 5.2)

Instruction: Build a material that has Li, Ti, Mn, Fe, O.

Response: <material> Li Li Li Li Ti Ti Ti Mn Mn Fe Fe Fe O O O O O O O O O O O O O O O O <sg8> </material>

Material generation with given bulk modulus (Section 5.3)

Instruction: Construct the composition for a material with a specified bulk modulus of 86.39 GPa.

Response: <material> Se Se Pd Sc <sg164> </material>

Design material with given bulk modulus.

Guide RNA design (Section 6.2)

Instruction: Generate a guide RNA for targeting the DNA sequence.

<dna>GACTGGCACCAG…CCCTCGC</dna>

Response: <rna>AGACACAGCGGGTGCTCTGC</rna>

Protein binding RNA design (Section 6.3)

Instructions: Given <protein>MSEY…SSGWGM</protein>, create an RNA molecule that binds to it.

Response: <rna>AAACAGG…CGTACGACA</rna>

BibTeX

@misc{naturelm2025,
    title={NatureLM: Deciphering the Language of Nature for Scientific Discovery}, 
    author={Yingce Xia and Peiran Jin and Shufang Xie and Liang He and Chuan Cao and Renqian Luo and Guoqing Liu and Yue Wang and Zequn Liu and Yuan-Jyue Chen and Zekun Guo and Yeqi Bai and Pan Deng and Yaosen Min and Ziheng Lu and Hongxia Hao and Han Yang and Jielan Li and Chang Liu and Jia Zhang and Jianwei Zhu and Kehan Wu and Wei Zhang and Kaiyuan Gao and Qizhi Pei and Qian Wang and Xixian Liu and Yanting Li and Houtian Zhu and Yeqing Lu and Mingqian Ma and Zun Wang and Tian Xie and Krzysztof Maziarz and Marwin Segler and Zhao Yang and Zilong Chen and Yu Shi and Shuxin Zheng and Lijun Wu and Chen Hu and Peggy Dai and Tie-Yan Liu and Haiguang Liu and Tao Qin},
    year={2025},
    eprint={2502.07527},
    archivePrefix={arXiv},
    primaryClass={cs.AI},
    url={https://arxiv.org/abs/2502.07527}, 
}