Mardochée Réveil, PhD
Back to Publications

Interpretable Multi-Source Data Fusion Through Latent Variable Gaussian Process

Sandipp Krishnan Ravi, Yigitcan Comlek, Arjun Pathak, Vipul Gupta, Rajnikant Umretiya, Andrew Hoffman, Ghanshyam Pilania, Piyush Pandita, Sayan Ghosh, Nathaniel Mckeever, Wei Chen, Liping Wang
2/6/2024

Abstract

With the advent of artificial intelligence and machine learning, various domains of science and engineering communities have leveraged data-driven surrogates to model complex systems through fusing numerous sources of information (data) from published papers, patents, open repositories, or other resources. However, not much attention has been paid to the differences in quality and comprehensiveness of the known and unknown underlying physical parameters of the information sources, which could have downstream implications during system optimization. Additionally, existing methods cannot fuse multi-source data into a single predictive model. Towards resolving this issue, a multi-source data fusion framework based on Latent Variable Gaussian Process (LVGP) is proposed. The individual data sources are tagged as a characteristic categorical variable that are mapped into a physically interpretable latent space, allowing the development of source-aware data fusion modeling. Additionally, a dissimilarity metric based on the latent variables of LVGP is introduced to study and understand the differences in the sources of data. The proposed approach is demonstrated on and analyzed through two mathematical and two materials science case studies. From the case studies, it is observed that compared to using single-source and source unaware machine learning models, the proposed multi-source data fusion framework can provide better predictions for sparse-data problems.

AI-Generated Overview

  • Research Focus: The paper proposes a multi-source data fusion framework utilizing the Latent Variable Gaussian Process (LVGP) to improve predictive modeling by addressing issues related to data quality and comprehensiveness in material informatics.

  • Methodology: The framework incorporates both categorical (source identifier) and quantitative variables into a single predictive model. Each data source is treated as a categorical variable that maps into a physically interpretable latent space, facilitating source-aware modeling.

  • Results: Case studies demonstrate that the LVGP framework significantly outperforms single-source modeling techniques, yielding better predictions, particularly in sparse-data scenarios, and providing insights into the relationships between different data sources.

  • Key Contribution(s): The work presents a novel dissimilarity metric based on latent variables to quantify the differences between information sources and introduces source-aware modeling capabilities that enhance predictive accuracy and interpretability.

  • Significance: This research addresses the challenges of integrating diverse data from multiple sources in material science, aiding in the optimization of materials by enabling better predictions and insights into source-specific behaviors.

  • Broader Applications: While focused on materials science, the LVGP framework is extendable to various engineering domains where multiple data sources are present, allowing for improved model accuracy and guide for adaptive sampling in experimental design.

Relevant Links

Stay Updated

Subscribe to my Substack for periodic updates on AI and Materials Science