AI Revolutionizes Cell Biology Research
Massachusetts Institute of Technology
Studying gene expression in cancer cells can provide valuable insights into the disease's origin and treatment outcomes. However, the intricate nature of cells, with their multiple layers, poses a challenge for biologists. Measuring proteins, gene expression, or cell morphology can each offer distinct perspectives on cancer's impact. The location within the cell where information originates is crucial, but capturing a comprehensive view requires scientists to employ various measurement techniques and analyze them sequentially.
Machine learning has the potential to expedite this process, but existing methods combine all data from different measurement modalities, making it challenging to trace the origin of specific information. To address this issue, researchers from the Broad Institute of MIT and Harvard, along with ETH Zurich/Paul Scherrer Institute (PSI), developed an AI-driven framework.
This innovative framework learns to distinguish between shared and unique information across various measurement modalities, providing a more holistic understanding of the cell's state. By identifying the source of each data point, biologists can gain a clearer picture of cellular interactions, aiding in the comprehension of disease mechanisms and the progression of various conditions, including cancer, Alzheimer's, and diabetes.
Xinyi Zhang, a former graduate student at MIT and now a group leader at AITHYRA in Vienna, Austria, emphasizes the importance of this approach. She states, 'Studying cells often requires multiple measurements, and our goal is to integrate this information effectively. By combining data from different modalities, we can achieve a more comprehensive understanding of the cell's state.'
The research team, including G.V. Shivashankar and senior author Caroline Uhler, developed a machine-learning framework that specifically identifies overlapping and unique information between measurement modalities. This framework enables users to input cell data and automatically determine the source of each data point.
The researchers rethought the typical design of machine-learning models for multimodal cellular measurements, known as autoencoders. These models typically have separate representations for each modality, but the MIT method introduces a shared representation space for overlapping data and distinct spaces for unique data from each modality, resembling a Venn diagram.
A two-step training procedure was employed to enhance the model's ability to handle complex data relationships. After training, the model can accurately identify shared and unique data when presented with new cell data.
The framework demonstrated its effectiveness in synthetic datasets and real-world single-cell datasets. It successfully distinguished between gene activity captured by two measurement modalities, such as transcriptomics and chromatin accessibility, and identified information specific to a single modality. Additionally, the researchers used the method to determine which measurement modality captured a protein marker indicating DNA damage in cancer patients, aiding clinical scientists in choosing the appropriate measurement technique.
Looking ahead, the researchers aim to enhance the model's interpretability and conduct further experiments to ensure accurate disentanglement of cellular information. They also plan to apply the model to a broader range of clinical questions, as Uhler notes, 'Our goal is to understand how different cellular components interact, which requires a careful comparison of modalities.'
This research is supported by various organizations, including the Eric and Wendy Schmidt Center at the Broad Institute, the Swiss National Science Foundation, the U.S. National Institutes of Health, the U.S. Office of Naval Research, AstraZeneca, the MIT-IBM Watson AI Lab, the MIT J-Clinic for Machine Learning and Health, and a Simons Investigator Award.