Data Integration

Scientific applications face unique problems that are not readily addressed by existing data management tools. Specifically, information sources often do not share a common terminology, have a variety of data representation formats and management architectures, and exhibit complex relationships between data and tools used to analyze the data.

SDSC excels at integrating data, especially big or “messy” data (think unmonitored crowdsourced data with little or no thought given to metadata needs). We understand that data arrives in many states and formats, and from many sources with different standards. Our experts specialize in all aspects of data integration, including ontology building, file formatting, setting metadata standards, coordinating space conversions, and visualization.

Software Tools: Open Earth Framework

SDSC’s Open Earth Framework visualization tools integrate world terrain maps with earthquake epicenter logs and subsurface models for the western US to reveal correlations between observed events and subsurface structures.

Use Case: Data integration in the Geosciences

Many scientific discoveries today are a result of collaborations between researchers sharing data and resources. By allowing scientists to share data and tools (services) via the web, we can enable interactions between a larger group of researchers working on a common problem.

Researchers from the Advanced Cyberinfrastructure Development group at SDSC have developed a data integration framework in the realm of the geosciences. The goal is to respond to the pressing need in the geosciences to interlink and share multi-disciplinary datasets to understand the complex dynamics of Earth systems. Creating a infrastructure to integrate, analyze, and model geoscientific data poses many challenges due to the extreme heterogeneity of geoscience data formats, storage and computing systems and, most importantly, the ubiquity of differing conventions, terminologies, and ontological frameworks across disciplines. The data integration framework is distinct from other efforts in scientific data management due to:

  1. Resource registration strategy
    Our solution requires data and service providers to register their resources with the framework. Instead of explicitly mapping resources to each other as is done in mediation systems, we implicitly map the sources to common metadata framework by describing each resource using the 4tuple, (Metadata descriptions, Ontology mappings, Spatial extent, Temporal extent). The 4tuple converts each resource into a point in a 4D space and thereby enables efficient discovery of resources using queries formulated over the 4D space. The ontology mappings become useful in overcoming heterogeneity in local schemas.
  2. Novel architecture
    The framework we have developed contains capabilities of both a data warehouse (data providers can store their datasets) and a data mediation system (users can design views spanning multiple distributed databases). Furthermore, by supporting integration of both data and services, our framework provides the unique capability to perform both data and application driven integration.

To explore a service engagement or request further information, please visit our Industry Partners Program webpage or contact us at