Nature Language Model: Deciphering the Language of Nature for Scientific Discovery

NatureLM Team
Microsoft Research AI for Science

NatureLM Highlights

Nature Language Model (NatureLM), developed by Microsoft Research AI for Science, is a groundbreaking sequence-based science foundation model designed to unify multiple scientific domains, including small molecules, materials, proteins, DNA and RNA. This innovative model leverages the "language of nature" to enable scientific discovery through text-based instructions.

Key Features:

  • Cross-Domain Integration: NatureLM excels at generating scientific entities across domains, bridging tasks that require multi-domain knowledge, such as designing ligands for protein receptors, generating RNAs tailored to RNA-binding proteins, and engineering guide RNAs for targeted CRISPR-based DNA editing.
  • Text-Driven Generation and Optimization: It generates and optimizes biological, chemical and material entities directly from text instructions.
  • Top Performance: NatureLM demonstrates top performance on many scientific tasks, often matching or surpassing state-of-the-art specialist models.
  • Scalability: NatureLM is available in different sizes (1 billion, 8 billion, and 46.7 billion parameters), with performance improving as the model size increases.
  • Versatile Applications: From drug discovery and material design to therapeutic protein development, NatureLM offers a unified approach to tackle complex scientific challenges.

NatureLM represents a significant advancement in the field of artificial intelligence, providing researchers with a powerful tool to drive innovation and accelerate scientific breakthroughs. By integrating knowledge across multiple scientific domains, NatureLM paves the way for new discoveries and advancements in various fields of science.

Method

Built within a GPT-style framework, NatureLM extends the power of conventional large language models by:

  • Pre-training on multi-scientific-domain data: NatureLM extends the pre-training of existing large language models using hundreds of billions of curated tokens from biology, chemistry, and materials science, building a robust foundation of cross-disciplinary knowledge while preserving the capabilities of the original models.
  • Instruction tuning for task adaptability: The model is fine-tuned on millions of question–answer pairs to better understand text-based prompts and generate relevant, contextually appropriate responses.
  • Custom fine-tuning for specialized tasks: Researchers can further tailor NatureLM to specific goals using techniques like reinforcement learning or domain-specific fine-tuning for optimal performance.

NatureLM comes in three different sizes—1B, 8B, and 46.7B (an 8x7B mixture of expert model) parameters—providing flexibility based on computational resources and the scope of a given project. Across 22 categories of scientific tasks, the larger models generally demonstrate higher performance, demonstrating the potential of continuously improved performance from more data and larger model.

Users can interact with NatureLM to facilitate scientific discovery and exploration. We offer several applications, including drug discovery, protein design, material design, and RNA design, enabling breakthroughs across diverse fields.

Applications

Drug Discovery

Protein to compound

Designing small molecules based on an input protein sequence is an exciting and promising challenge in drug discovery, especially when structural information is unavailable or incomplete. NatureLM addresses this task by enabling both full molecule generation and molecular fragment generation, using the protein target sequence as input. This process, known as "sequence-to-compound," aims to directly design potential drug candidates by leveraging the protein's amino acid sequence to generate molecules with strong binding affinity and favorable properties.


# Scenario 1: Complete molecule generation
Instruction: Produce a compound guided by the target <protein>LALSLTADQMVSALL...SYDLLLEMLDAH</protein>
Response: <mol>CC1=C(c2cccc(O)c2)C(c2ccc(I)cc2)Oc2ccc(O)cc21</mol>

# Scenario 2: Molecular fragment generation
Instruction: Design a compound with reference to the target  <protein>DTKEQRILR...EKAIYQGP</protein>  and the fragment <fragA>O=c1[nH]cnc2c(O)cc([*:1])c([*:2])c12</fragA>
Response: <fragB>Fc1ccc([*:1])cc1.Fc1ccc([*:2])cc1</fragB>

We evaluated NatureLM on a held-out test set by generating an equal number of small-molecule ligands for each protein target. The generated compounds were assessed using:

  • Docking Score: Calculated by AutoDock Vina, which estimates the binding affinity of the ligand to the protein. Lower scores indicate better binding affinity.
  • Synthetic Accessibility Score (SAS): A measure of the synthetic feasibility of the generated compounds, calculated using RDKit. Lower scores indicate easier synthesis.

From the full-compound generation results, we observed that NatureLM outperforms previous classification-based methods in generating compounds with better docking scores. Additionally, larger versions of NatureLM consistently produce compounds with improved binding affinity, as reflected in the docking scores. The synthetic accessibility score of NatureLM-generated compounds is generally favorable, demonstrating its ability to balance potency and synthesizability. A brief case study is presented in the accompanying video.

Vina SAS
TargetDiff -6.08 0.67
TamGen -6.66 0.76
NatureLM (1B) -6.8 0.82
NatureLM (8B) -6.92 0.81
NatureLM (8x7B) -6.95 0.82
Video 1: Design compounds using the protein sequence as input.

For molecular fragment generation, we selected three PDB files published after May 2024 to ensure they were not included in our training data. Using the target sequence and the fragment we aimed to retain, we generated molecular fragments. The results, demonstrated in the figure below, show that larger versions of NatureLM consistently yield better outcomes.

Fragment Docking
Fragment Docking
Figure 1: Designing compounds based on the target protein sequence and specific fragments to retain.

Text-Driven Generation and Optimization

Optimizing the binding affinity of a drug compound is a critical step in drug discovery, as it directly impacts the effectiveness and specificity of the compound for its biological target. Higher binding affinity often translates to improved efficacy and lower required dosages, which can reduce potential side effects. This process, however, is traditionally resource-intensive and time-consuming.

With NatureLM, researchers can streamline this process by interacting with the model using natural language. For example:


Instruction: Improve the binding affinity on Uridine-cytidine kinase 2 of <mol>Cc1ccc(-c2nc3c(c(SCC(=O)Nc4ccccc4)n2)Cc2cccc(C)c2O3)cc1</mol>
Response: <mol>Cc1ccc(-c2nc3c(c(SCC(=O)Nc4cccc(C(=O)O)c4)n2)Cc2cccc(C)c2O3)cc1</mol>
                        

NatureLM recommends the SMILES representations of candidate compounds with potentially improved binding affinity. Users can specify the desired number of candidates for generation and further evaluate them using methods such as docking, Free Energy Perturbation (FEP), Structure-Activity Relationship (SAR) analysis, or expert review. In two demonstrated cases, NatureLM successfully identified compounds that aligned with previously validated results from wet-lab experiments, highlighting its potential to propose viable candidates. Additionally, NatureLM can design entirely new compounds with superior docking scores, with further findings currently awaiting experimental validation.

Binding affinity optimzation
Figure 2: Binding affinity optimization. We explore two targets: Catechol O-methyltransferase and Beta-lactamase class C. "Input molecule" refers to the molecule being optimized. "Wet-lab verified" output denotes the output molecule validated with improved binding affinity through wet-lab experiments. "Improved docking score" represents the results with better docking scores compared to the previous two compounds.

Protein Design

(In)Soluble and stable protein design

We can chat with NatureLM to design proteins:


Instruction: I require a stable protein sequence, kindly generate one.
Instruction: Generate a soluble protein sequence.
Instruction: Produce a protein sequence that is not soluble.
Several showcases demonstrating these capabilities are provided below.
Protein Case
Figure 3: Generating stable, soluble, and insoluble proteins using text as input. The sequences are folded using OmegaFold.
Additionally, we utilize SFM-Protein to train a predictor that determines whether an input protein is stable. The success rates for the 1B, 8B, and 8x7B models are 64.4%, 65.7%, and 81.2%, respectively. These results demonstrate that larger models can lead to better performance.

Heme-binding protein design

Heme is an essential cofactor with critical roles in oxygen transport, electron transfer, and enzymatic activity. We use NatureLM to design proteins that can bind to heme:


Instruction: Design a heme-binding protein sequence.  
Response: <protein>MSAAEGAVVFSEEKEALVLK...</protein>
Instruction: Generate a protein that can bind to <mol>[Fe+2].C=CC1=C(C)c2cc3[n-]c(cc4[n-]c(cc5nc(cc1n2)C(C)...</mol>
Response: <protein>ETIGKRVFVHYCHGCHSQNALGI...</protein>

After generation, we use NatureLM's protein-to-description function to annotate the generated sequences. Only the sequences containing keywords related to heme are retained for further analysis. Subsequently, we employ Phoenix to predict the complex structure of the generated protein with the heme group. Following this, we evaluate the remaining candidate proteins.

Among the generated proteins, we identify a promising candidate with a sequence similarity of 0.55 to the NR database. Interestingly, while the sequence exhibits moderate similarity, the predicted structure closely resembles existing proteins. Notably, two histidine residues are positioned near the central iron atom of the heme group, facilitating coordinated bonding and potential functional activity.

Video 2: Designing a heme-binding protein from textual input.
Video 3: Designing a heme-binding protein from SMILES input.

Antibody design

Antigen-to-antibody design is a pivotal task in modern biomedical research, underpinning the development of targeted therapeutics and vaccines. By leveraging computational methods to predict and engineer the complementarity-determining regions (CDRs) of antibodies, researchers can create molecules that bind antigens with high specificity and affinity. NatureLM, for instance, enables the precise engineering of the CDR3 region of an antibody using the antigen sequence and the antibody’s framework. The example input below illustrates this application:


Instruction: Using antigen <protein>RDFNNLTKGLCTINSWHIYGKDNAVRIGEDSDVLVTREPYVSCDPDECRFYALSQGTTIRGKHSNGTIHDRSQYRAL...</protein> and antibody frameworks <antibody>QVQLQQSGAELVKPGASVRMSCKASGYTFTNYNMYWVKQSPGQGLEWIGI...</antibody>, <antibody>WGQGTTVTV</antibody>, create the CDR3 regions.
Response: <antibody>ARSEYGGYYAYAMDY</antibody>

NatureLM outputs the CDR3 regions. The figure visualized antigen–antibody complex confirms that the designed CDR region interacts effectively with the antigen.

Antigen to Antibody
Figure 4: Visualization of the antigen-antibody complex structure. The interaction is depicted using our internal tool to highlight the detailed binding relationship.

Material Design

NatureLM enables researchers to generate materials with specified properties by using natural language instructions. This text-to-material approach is particularly promising as it streamlines the discovery process, allowing researchers to explore vast design spaces more efficiently than traditional trial-and-error methods. Below, we present two examples that demonstrate its capabilities in material composition and property

Generating a material with specified elements

In the first example, we prompt NatureLM to draft a material that includes specific elements.


Instruction:  Draft a material that includes Sr, Nd, Bi, O. 
Response: <material> Sr Sr Sr Sr Nd Nd Bi Bi O O O O O O O O O O O O <sg14> </material>

This output specifies both the composition of the material (4 Sr, 2 Nd, 2 Bi, and 12 O per unit cell) and its space group (14 for this example). To further enhance its capabilities, we fine-tuned the 1B version of NatureLM, transforming it into a material structure generator referred to as NatureLM-Mat3D. NatureLM-Mat3D takes the composition and space group as input and autoregressively outputs the lattice parameters, followed by the atomic coordinates within that lattice.

Using MatterSim to evaluate the properties of the generated structure, we found that its energy above hull (e_hull) is 0.00, confirming that it is a stable material.

3D Material Representation
Figure 5. 3D representation of the predicted material.

Generating a Material with a Specified Bulk Modulus

Materials with an ultra-high bulk modulus are highly valued for their exceptional stiffness and incompressibility, making them critical for applications in extreme environments, industrial tools, and advanced engineering. To test NatureLM's ability to generate such materials, we tasked it with designing a material with a desired bulk modulus of 400 GPa, a value comparable to the stiffness of diamond:


Instruction:  Provide the composition for a material designed to have a desired bulk modulus of 400 GPa. 

Similar to the evaluation pipeline used in the first user application, we begin by using NatureLM-Mat3D to predict the 3D atomic structure of the material. We then employ MatterSim to estimate the e_hull and bulk modulus. We manually identified cases where MatterSim predicted bulk modulus values within a 5 GPa range of the target (400Gpa here). Two such cases were selected for further validation using density functional theory (DFT) calculations.


Response: <material> Os Os Os Re <sg187> </material>
Response: <material> Re Re Re C <sg187> </material>

The DFT results revealed bulk modulus values of 390 GPa and 394 GPa, which closely align with the target value of 400 GPa. Beyond achieving the bulk modulus target, the two generated structures were confirmed to be novel compared to those available in the Materials Project database. This novelty underscores NatureLM's potential for discovering new materials with exceptional mechanical properties, thereby broadening the scope of material design and innovation.

High Bulk Material Diagram
Figure 6. High Bulk Material Diagram

While NatureLM demonstrates significant potential in generating material compositions and predicting properties, its space group predictions are not always accurate, which can impact the precision of the resulting 3D structures and property evaluations. Despite this limitation, the method remains highly valuable as it drastically reduces the time and computational cost associated with exploring and identifying new material compositions.

RNA Design

Guide RNA design

Guide RNA (gRNA) plays a critical role in CRISPR-Cas9 gene editing by directing the Cas9 nuclease to a specific genomic target site. A gRNA consists of two components: a crispr RNA (crRNA), which is approximately 20 nucleotides long and complementary to the target DNA sequence, and a tracrRNA, which binds to Cas9. The crRNA garners more attention as it features a variable sequence critical for targeting specific DNA, whereas the tracrRNA has a more fixed role, primarily aiding in Cas9 binding. As an initial step in applying NatureLM to RNA design tasks, we can efficiently generate the crRNA component of gRNA sequences, as demonstrated in the following example:


Instruction: Generate a guide RNA for targeting the DNA sequence <dna>GACTGGCACCAG...CCCTCGC</dna>
Response: <rna>AGATATCGGGCCCGCGACGG</rna>

The NatureLM's response is a sequence that targets the DNA region and includes the PAM sequence (Protospacer Adjacent Motif). To obtain the final crRNA, remove the 'NGG' PAM sequence, reverse complement the DNA sequence, and apply the central dogma principle by converting DNA into RNA (replacing all 'T' with 'U'). These quick and easy steps ensure the resulting crRNA is ready for use.

A valid crRNA must meet the following criteria:

  1. Be 17 to 24 nucleotides long.
  2. Match a specific region in the provided DNA sequence.
  3. Be followed by an "NGG" Protospacer Adjacent Motif (PAM) in the DNA sequence, where "N" represents any of the four standard nucleotide bases: adenine (A), cytosine (C), guanine (G), or thymine (T).
The example figure below demonstrates that NatureLM can identify the target sequence for guide RNA, and the resulting guide RNA satisfies all the necessary criteria. With an impressive accuracy of 95.7% in designing valid gRNAs, NatureLM demonstrates that we are already very close to fully automated RNA design.

Guide RNA Design
Figure 7: Visualization of the designed guide RNA (gRNA) structure.

More Examples

We summarize all the cases presented in the paper in the following charts. These charts provide an overview of the generated tasks we have worked on.

Generate molecule with given text compound properties (Section 3.2)

Instruction: Generate a molecule with four hydrogen bond donors.

Response: C(C[C@@H](C(=O)O)N)CN=C(N)N

One can replace `four hydrogen bonds ` to other properties like `QED 0.5`, `TPSA 90`, `three hydrogen bond donors`, etc.

Protein sequence to full molecule design (Section 3.4)

Instruction: Produce a compound guided by the target. <protein>LALSLTADQMVSALL...SYDLLLEMLDAH</protein>

Response: <mol>CC1=C(c2cccc(O)c2)C(c2ccc(I)cc2)Oc2ccc(O)cc21</mol>

Protein sequence to molecular fragment generation (Section 3.4)

Instruction: Design a compound with reference to the target <protein>DTKEQRILR….EKAIYQGP</protein> and the fragment <fragA> O=c1[nH]cnc2c(O)cc([*:1])c([*:2])c12 </fragA>.

Response: <fragB>Fc1ccc([*:1])cc1.Fc1ccc([*:2])cc1</fragB>

Here, ''[*:digit]'' refers to the connection point of the molecular fragment.

Binding affinity improvement (Section 3.5)

Instruction: Improve the binding affinity on Uridine-cytidine kinase 2 of <mol>Cc1ccc(‑c2nc3c(c(SCC(=O)Nc4ccccc4)n2)Cc2cccc(C)c2O3)cc1</mol>.

Response: <mol>Cc1ccc(‑c2nc3c(c(SCC(=O)Nc4cccc(C(=O)O)c4)n2)Cc2cccc(C)c2O3)cc1</mol>

Metabolism optimization (Section 3.6)

Instruction: Transform <mol>COCCn1c(=O)c(-c2ccccc2)nc2cnc(N3CCOCC3)nc21</mol> to yield lower CYP1A2 inhibition and present a better metabolic safety record.

Response: <mol>COc1ccc(‑c2nc3cnc(N4CCOCC4)nc3n(Cc3cccc(OC)c3)c2=O)cc1</mol>

You can replace CYP to the enzymes you wish as input.

Reaction prediction (Section 3.7)

Instruction: Please suggest possible reactants for the given product

<product>CC(=O)c1ccc2c(ccn2C(=O)OC(C)(C)C)c1</product>.

Response: <reactant>CC(=O)c1ccc2[nH]ccc2c1.CC(C)(C)OC(=O)OC(=O)OC(C)(C)C</reactant>

Protein generation (Section 4.2)

Instruction: I require a stable protein sequence, kindly generate one.

Response:<protein>GSVEVKIGTDRTTASSQDEAEERARRAGARRVRIDGNEVEVEM</protein>


Instruction: Generate a soluble protein sequence.

Response:<protein>MSLPNIFVEELVHTPIEKQVIEIVERKGIGHPDSLADGMAEAMSRELSREYIRRFGAVLHHNTDETQIVAGRSNPQFGGGEVIEPIYVLLVGRATKFFNGEYIPTDKIALKAARDYIRQHMQNLDPELDVVFNVRLGEGSTDLQDVFRRKSGNVALANDTSFGIGFAPLSETERLVFNVERRIYEEFRKKNPAIGEDVKVMGLREKDRISLTIAAAFVDRYVANIKEYDAIKEELENFVKEISSEYTEREVEVFVNTADDYETGCVYLTVTGTSAENGDDGSVGRGNRCNGLITPGRPMSMEASSGKNPINHVGKIYNLLANQIAARIAEEVEGVEEVYVRILSQIGKPINEPKALSVQVIPKSGYDISKLERPARDIAEEMIANVGKITDMVIEGKVRTFEGHHHHHH</protein>


Instruction: Produce a protein sequence that is not soluble.

Response:<protein>MAFLSQLQGGFAQLKQKAEAAAPSIADSLNQLKEAPKFMESAFKTGLNKVKLDSLGKVLNVPSLGRTQSPIDIVPVITAFGEHLQNAHFEVTYESTGEFKAVNDGNSIWLMREGNSSELAISFLPEEQYHLDAVNFHWATEPMNGSEHTIGGVGYAGEMHLIHRNTRFATMADALKQPNGVIAIAVFLNESHDDNAVFSPLINLLPQVIYKGSECKLCSFDFQTFFPVAEKTKEFWMYEGSETTDPFRETVNWIVIRAALPISSHQLDKLREVRAGRYDEEFSDKVPMKPLRPIQNPSSRTIQSSFRSVAGAPDLGFRQ</protein>

Antibody design and optimization (Section 4.3)

Instruction: Using antigen <protein>TQVCTGTDMKLR…GESSEDCQS</protein> and antibody frameworks <antibody> IVLTQTPS…LAVYYC</antibody> and <antibody>FGGGTRLEIEVQ</antibody>, create the CDR3 regions.

Response: <antibody>QQYSNYPWT</antibody>

Composition to material generation (Section 5.2)

Instruction: Build a material that has Li, Ti, Mn, Fe, O.

Response: <material> Li Li Li Li Ti Ti Ti Mn Mn Fe Fe Fe O O O O O O O O O O O O O O O O <sg8> </material>

Material generation with given bulk modulus (Section 5.3)

Instruction: Construct the composition for a material with a specified bulk modulus of 86.39 GPa.

Response: <material> Se Se Pd Sc <sg164> </material>

Design material with given bulk modulus.

Guide RNA design (Section 6.2)

Instruction: Generate a guide RNA for targeting the DNA sequence.

<dna>GACTGGCACCAG...CCCTCGC</dna>

Response: <rna>AGACACAGCGGGTGCTCTGC</rna>

Protein binding RNA design (Section 6.3)

Instructions: Given <protein>MSEY…SSGWGM</protein>, create an RNA molecule that binds to it.

Response: <rna>AAACAGG…CGTACGACA</rna>

BibTeX

@misc{NatureLM2025,
    title={Nature Language Model: Deciphering the Language of Nature for Scientific Discovery}, 
    author={Yingce Xia and Peiran Jin and Shufang Xie and Liang He and Chuan Cao and Renqian Luo and Guoqing Liu and Yue Wang and Zequn Liu and Yuan-Jyue Chen and Zekun Guo and Yeqi Bai and Pan Deng and Yaosen Min and Ziheng Lu and Hongxia Hao and Han Yang and Jielan Li and Chang Liu and Jia Zhang and Jianwei Zhu and Ran Bi and Kehan Wu and Wei Zhang and Kaiyuan Gao and Qizhi Pei and Qian Wang and Xixian Liu and Yanting Li and Houtian Zhu and Yeqing Lu and Mingqian Ma and Zun Wang and Tian Xie and Krzysztof Maziarz and Marwin Segler and Zhao Yang and Zilong Chen and Yu Shi and Shuxin Zheng and Lijun Wu and Chen Hu and Peggy Dai and Tie-Yan Liu and Haiguang Liu and Tao Qin},
    year={2025},
    eprint={2502.07527},
    archivePrefix={arXiv},
    primaryClass={cs.AI},
    url={https://arxiv.org/abs/2502.07527}, 
}