Madrid. Topazium has been honored in the IV edition of the Ennova Health Awards organized by Diario Médico and Correo Farmacéutico in the big data and artificial intelligence category.
MACHINE LEARNING SYSTEMS FOR CANCER THERAPY RESEARCH
Two machine learning systems developed by Topazium, GFPrint™ and PredLung™, have been used to extract crucial information from cancer patients that may be relevant for therapy design. Three case studies are presented, focusing on:
i) The discovery of potential new targets and biomarkers for squamous non-small cell lung cancer.
ii) The identification of responders to lung cancer therapies.
iii) The creation of molecular signatures in cancer patients that can be leveraged for personalized medicine.
Introduction
According to the latest data from the Global Cancer Observatory of the World Health Organization (https://gco.iarc.fr/en), in 2022, lung cancer was the most prevalent type of cancer worldwide, with nearly 2.5 million cases, surpassing breast cancer (almost 2.3 million cases) and colorectal cancer (1.9 million cases). Lung cancer also had the highest mortality rates that year, with 1.8 million deaths, exceeding colorectal cancer (0.9 million deaths) and liver cancer (0.7 million deaths). These figures highlight that lung cancer represents an unmet medical need, requiring improvements in early detection, diagnosis, and more effective treatments.
Personalized medicine can help improve the current outlook for cancer by tailoring treatments to the individual condition of each patient. Understanding patient characteristics is essential for designing tailored treatment strategies and predicting the likely course of the disease. Leveraging all available clinical data is, therefore, critical. Genetic information is particularly important in this context, and with the increasing availability of genetic tumor data, it is crucial to utilize these data effectively to maximize their potential. Consequently, there is an urgent need for tools designed to efficiently extract and analyze the extensive genetic data generated in clinical settings. The use of artificial intelligence-based systems undeniably adds value in this regard.
At Topazium, we have designed and developed new machine learning (ML) systems that leverage patients’ biological information to uncover hidden characteristics that can be therapeutically exploited. One such system is GFPrint™, an ML tool designed to utilize genetic data from patients for prognostic and therapeutic purposes. GFPrint™ analyzes tumor DNA sequence data to create a virtual representation of patients, known as a synthetic state representation (SSR), within a three-dimensional or latent space. The location of these SSRs within the latent space enables the clustering of similar patients into groups, and this grouping, combined with clinical data from the same patients, allows for insights that can be applied in various ways: proposing new biomarkers, predicting potential treatment responses, identifying new therapeutic targets, or uncovering patterns to help select the most suitable treatment for each patient.
Case 1: Identification of Targets and Biomarkers
GFPrint™ was utilized to analyze data from The Cancer Genome Atlas (TCGA), the largest publicly available repository of cancer patient data. Whole-exome sequencing data from tumor samples of approximately 12,000 patients, representing nearly 150 histotypes, were used; these samples were collected at the time of diagnosis.
The sequencing data were analyzed by GFPrint™ to construct a latent space where the Synthetic State Representation (SSR) of each patient was represented as a single point. This approach identified three distinct patient clusters (Figure 2A) and revealed significant differences in overall survival (OS) among these clusters.
Case 2: Focusing the Analysis on Lung Cancer Patients
When focusing on lung cancer patients within this dataset, it was discovered that patients diagnosed with epidermoid non-small cell lung cancer (eNSCLC) in stage I (i.e., those with the best prognosis) exhibited significantly different overall survival (OS) rates (p = 0.003) depending on whether they belonged to cluster 1 or cluster 0. Strikingly, patients in cluster 0 showed the poorest outcomes, with a median OS comparable to that of stage IV eNSCLC patients, who typically have the worst prognosis.
An analysis of the mutational burden in cluster 0 revealed 167 mutations (affecting the same number of genes) unique to patients in this group, suggesting these mutations may be responsible for their poorer prognosis (Figure 3). Among these, 25 mutations were found to significantly influence patient survival, with those carrying one of these mutations displaying worse outcomes. Further investigation identified that the genes affected by these 25 mutations played a key role in vesicle-mediated transport pathways involving Rab GTPases. This finding suggests that this pathway and the associated genes may serve as novel therapeutic targets, warranting the search for modulators that could become innovative drugs.
Additionally, these 25 genes could form a biomarker panel to predict worse prognosis in eNSCLC patients, regardless of disease stage. While extensive experimental work is needed to validate these hypotheses, it is clear that GFPrint™ has successfully leveraged the genetic data of eNSCLC patients to provide valuable insights and generate actionable hypotheses worthy of further investigation.
Case 3: Using in Vitro Data to Design Personalized Therapies Through Molecular Signatures
One of the most promising applications of GFPrint™ lies in its potential for personalized medicine by aiding in the creation of drug molecular signatures tailored to different patient groups. These signatures are inferred from in vitro experimental data obtained with drugs tested on laboratory cell lines that share the genetic characteristics of the patients.
A recent study provided genome sequencing data for 125 cell lines derived from various pediatric tumors. These DNA sequences were processed with GFPrint™ using the previously described methodology and assigned to groups matching those identified in the TCGA samples: 104 cell lines were assigned to group 1, and 21 to group 0. The in vitro pharmacological activity of 1,766 chemical compounds, tested in the same study to evaluate their ability to inhibit cell growth, was then analyzed.
A binomial proportion test revealed non-random associations between the compounds’ effects on the 125 cell lines and their respective groups. Of the 1,766 compounds, 30 were significantly more potent in inhibiting the growth of cell lines in group 0 compared to those in group 1 (p < 0.05). Interestingly, an analysis of the mechanisms of action of these 30 compounds showed that one-third interfere with the cell cycle. This suggests that cells with genetic profiles similar to those of patients in group 0 may be more sensitive to antitumor drugs targeting the cell cycle.
Given the limitations of the data (e.g., only a single compound concentration was tested per cell line), these results and their conclusions must be handled with extreme caution and are not definitive. However, this exercise highlights the potential of GFPrint™ to leverage in vitro data for use in personalized medicine by exploiting genetic similarities between patients and the cell lines commonly used in experimental research.
Conclusions
The machine learning tools developed by Topazium, GFPrint™ and PredLung™, have proven effective in leveraging biological data from cancer patients to extract valuable insights. These include the discovery of potential new therapeutic targets and biomarkers, the prior identification of therapy responders by analyzing hematological parameters, and the development of molecular signatures that enable personalized therapies.
These tools follow a similar strategy: first, functional features are extracted from the available data to create a virtual representation of patients. Next, this virtual representation is analyzed using various unsupervised models to identify nonlinear relationships in the data, ultimately classifying patients into distinct groups with unique clinical profiles. Finally, exploration of the original data helps pinpoint critical features influencing the observed clinical differences, such as mutational burden or hematological parameters.
The tools have been tested and validated using public datasets, despite notable limitations. Data from the TCGA dataset stem from diverse sources, with associated clinical information potentially processed in varying ways, which affects data interoperability. The SQUIRE trial dataset, while not affected by these inconsistencies due to its centralized management, has the drawback of a limited sample size; expanding the analysis to include additional patient groups would have been ideal. Similarly, in vitro data obtained from tumor cells reflect the effect of a single compound concentration, a significant limitation. Comprehensive concentration-response data would provide more reliable insights into compound effects and pharmacology.
Despite these constraints, both tools have performed well in identifying previously overlooked critical attributes. This has facilitated the generation of robust hypotheses, whose true value must be confirmed through experimental work. We firmly believe that these machine learning frameworks are valuable tools that will drive cancer research forward, with the ultimate goal of improving patient health and well-being.