CATNIP for chemists: Data-driven tool broadens access to green chemistry

Emily Kagey and Lauren Smith

Oct 1, 2025

Abstract digital artwork showing two bright white points radiating fine, curved lines in multiple colors, including purple, pink, green, and yellow, against a dark gradient background. The lines resemble flowing particle trails or energy fields, creating a sense of motion and interconnectedness.

Source: Rajani Arora, University of Michigan Life Science Institute

Abstract representation of the connections between chemical space and protein space that can be achieved using CATNIP, a new data-driven tool.

Carnegie Mellon University and University of Michigan researchers have developed a new tool that makes greener chemistry more accessible. Described in Nature, the tool removes a major barrier to wider adoption of biocatalysis.

Biocatalysts, also called enzymes, are a type of protein that have evolved to perform chemistry that can be complex and incredibly efficient — typically in water and at room temperature — removing the need for toxic or expensive chemical reagents to run reactions. But they are also highly selective, meaning that they are specialized to work with the specific starting compounds, or substrates, they interact with in their natural environment.

To capitalize on the power of biocatalysts in the lab, though, chemists need to know what other substrates a protein can work with and, more precisely, which enzymes will work with their desired substrate.

"Biocatalysis offers a more sustainable way to build molecules, and it can also give us access to molecules that we couldn't build using traditional chemical methods," explains Alison Narayan. "But most of the known substrates for these biocatalysts come from nature, which is just a very small subset of the molecules that chemists work with."

Narayan, a chemistry professor at the University of Michigan, and Gabe Gomes, an assistant professor of chemical engineering and chemistry at Carnegie Mellon, created CATNIP, a tool that bridges the longstanding gap between the starting compounds chemists are working with and the enzymes that could potentially react with those compounds.

The project began with an effort to match proteins with substrates on a large scale. Focusing on one family of enzymes, Alexandra Paton designed a high-throughput reaction platform that allowed the team to test more than 100 substrates against each protein across the entire protein family.

"We discovered hundreds of new connections between chemical space and protein space and built this diverse dataset," says Paton, a former postdoctoral fellow in Narayan's lab. "That is when we began to think more broadly about what we could build with all this data."

Gomes and Daniil Boiko ('25), a Ph.D. student at Carnegie Mellon at the time, envisioned an enzyme recommender system. The two teams leveraged the dataset and machine learning to create a predictive model that can navigate between the protein landscape and the chemical landscape.

Our model offers scientists a way to derisk their experimental planning when choosing the enzyme to perform a transformation.

Gabe Gomes, Assistant Professor, Chemical Engineering, Chemistry

Their model maps a substrate in chemical space, selects similar (or neighboring) substrates, looks up known reactivity for the substrate and neighbors, and constructs a list of compatible enzymes and their neighbors. A separate model re-ranks the list, putting promising enzyme candidates on top. Gomes and Boiko's two-step approach is unique in the field.

The predictive capability of the model is analogous to web search. Web search results are optimized so that we get good results on the first page, though there's no guarantee that the first result listed will be the best answer to the query. "But if it's not the first, it's probably the second. If it's not the second, probably the third. That's basically what we do here," says Boiko.

The resulting open-access CATNIP online platform enables chemists to input their starting compound and receive a ranked list of biocatalysts from this protein family that would best enable a chemical transformation; or, going in the other direction, they can start with an enzyme of interest and identify its potential substrates.

Using the ranked list allows a scientist to more narrowly focus their experiments to determine the best-performing enzyme. "Our model offers scientists a way to derisk their experimental planning when choosing the enzyme to perform a transformation," says Gomes.

"It is a great starting model to enable synthetic campaigns using biocatalysts," says Paton. "And there is already work underway to begin expanding the database beyond this one enzyme family."

This research is part of the National Science Foundation Center for Chemoenzymatic Synthesis.


For media inquiries, please contact Lauren Smith at lsmith2@andrew.cmu.edu.