Data Mining for Biomedical Informatics Group

Our Research


The size, complexity, and variety of data sets resulting from biomedical research continue to increase at an exponential rate. Recently, various high throughput technologies have been developed that capture a wide variety of genomic and metabolic information such as Single Nucleotide Polymorphisms (SNPs), gene expression, copy number variation, and metabolite and protein abundance. A major goal of analyzing such data sets is the discovery of patterns (biomarkers) in the data that are associated with disease development and prognosis. Another key goal in the analysis of these data sets is to better understand the underlying biology.

Whether for biomedical understanding or discovery of biomarkers, analysis of high throughput data sets faces a multitude of challenges including high dimensionality, heterogeneity of the attributes, missing data, imbalanced classes, and auto-correlation. Most common diseases are complex, involving interaction between multiple biomarkers and tend to have different biomarkers in different populations. In addition, different types of high-throughput data are quite diverse in their characteristics, creating significant challenges in data integration when several types of data sets are employed. Additional integration challenges arise when integrating clinical data, which is increasing available from electronic health care records. With the increasing complexity of multimodal data sets, the traditional approach of testing a handful of simple hypotheses does not take full advantage of the richness of the data set. New approaches are needed that can find high order interactions and more complex patterns. The limitations of many traditional approaches has prompted the use of techniques from data mining and machine learning for the analysis of these new data sets.

Our research group is focused on developing novel data mining and machine learning techniques to analyze a wide variety of biomedical data sets. Some of our latest areas of focus have been on analyzing electronic health record (EHR) data, brain scan data, and next-generation sequencing data. We collaborate with an interdisciplinary set of researchers in fields such as nursing, psychology, cell biology, and cancer genetics.