Business Data and Analytics Articles

Integrating Multiple Data Sources | Data Integration Techniques

Written by Brian Cook | Jan 5, 2023 10:50:00 AM

In our blog “How to Wrangle Your Data Into a Powerful Single Source of Truth,” we discussed the methodologies for consolidating multiple data sources. Our focus was on creating the so-called “golden record” that everyone in the organization can rely on for the best data-driven decisions.

Answering how to do that leads to a deeper, more conceptual approach: How do you bring disparate data sources together? Where do they come from, and how can you leverage that approach to support your data analytics?

The short answer is that there are many methods and technologies to bring data sources together. The best starting point is an assessment to understand your business and users' needs. That assessment can provide alternatives and, ultimately, drive consensus on one or more approaches that best meet the needs and goals of IT and the rest of the organization.

About this Blog

To provide an appropriate context here, we will focus on the output of various data integration approaches that can be used to bring your data sources together (ETL/batch or real-time). Those approaches will be employed in decisions on the input in support of your analytics (tables, fields in the data warehouse/lake/marts).

There are both physical and logical solutions that can be considered. They are the industry-recognized data access/store approaches for analytics consumption. Accordingly, this blog will focus on the following key points regarding data integration approaches: 

  1.  Physical data stores—Moving data into data warehouses, data lakes, and data marts

  2.  Logical solutions for data integration (i.e., the data does not move)

  3. The pros and cons of the logical (or virtual) vs. a physical data integration approach

Physical (Migrating) Data Sources

Physical data integration is data moving from one location to another: replication, relocation, etc. It is physical in the sense that once it is removed from its original source, it takes on a new form, especially when logic/transformation is applied to the data along the way. That logic serves to manipulate the data and satisfy unique analytics requirements. 

There are three types of physical data stores:

1. Data Warehouse

A data warehouse can store both transactional (e.g., purchase data) and aggregated data (e.g., total purchase amount for a given day). So, the aggregation has a dimensional aspect.

Two different data aggregation approaches can be used to build a data warehouse, time and spatial. Also, data warehouses are typically organized by subject, such as financial, supplier information, marketing, etc.

The Case for Using a Data Warehouse

A data warehouse stores both transactional and aggregated data. The data can be either batch (using ETL), which is typically point-in-time, or (near)  real-time.  Businesses can start looking at their analytics at the top level using aggregated data. They can then drill down into another level (transactional data).

So, the data warehouse provides a top-down view of the business. It can be an accumulation of data from multiple sources and can serve as an accumulation point for data from multiple sources for use as a single source of truth.

2. The Data Lake “Data Sandbox”

This physical source typically stores the raw data. Unlike a data warehouse, where data quality can be enhanced, a data lake only houses the raw data.

The Value of Data Lakes

Data scientists typically utilize data lakes to help confirm hypotheses and develop new approaches to view the data. Data scientists don’t know the answers ahead of time. Instead, they find answers through trial and analysis with formulas of how the data goes together and ways to use it moving forward. Data lakes can typically be used to support automated intelligence and machine learning

Data lakes, however, can still employ batch (extract methodologies) to move their data: 

- Batch: extracted and loaded more so than transformed 

- Real-time

3. Data Marts

Data marts are often an extension of a data warehouse or data lake. The purpose of the data mart can be application or subject-specific and is built for performance. 

Data marts are based on facts and dimensions. Their fact content could be the price of something, whereas the dimension could be the brand, the year it was sold, etc. Dimensions are like attributes of facts.

Data marts also allow you to further aggregate already-aggregated data. The source of the mart could be from either the data warehouse or data lake—or both.

Logical Data Sources

A logical data solution, also known as data virtualization, is a logical data warehouse, logical data lake, or logical data mart. Data virtualization is a virtual representation of physical approaches. But in addition to accessing source data–similar to the physical approaches–it may also use existing data warehouses, data lakes, and data marts as input.

The difference is that the data isn’t moving. Again, it is a virtual representation of the data. It creates the manifestation of the logic in the output. The logical model you’re building could be any of the previously mentioned physical data models. The goal is the same–getting a result set from an analytics point of view.

Virtual vs. a Physical Data Aggregation Approach

A physical solution is a traditional approach that’s been used for 30+ years. The virtual approach is conceptually newer and lags behind the physical approach in industry use and acceptance.

Advantages of the virtual approach:

  • The data arrives faster because you don’t move the data; you only access it.

  • There is less replication and the need to store converted data—resulting in time and cost savings.

  • Reduced software license costs--IT doesn’t need all of the tools they would use to replicate and store the data.

  • The virtual approach requires fewer resources to manage, as there aren’t as many moving parts.

Disadvantages of the virtual approach:

With the virtual approach, you should use caution when considering accessing source data, as it could have performance implications on the system that manages that data. Many organizations impose limitations on what applications or software may be used to access source data. With this in mind, it still may be required to physically move data to another location, which could be a data warehouse, data lake, or data mart, before applying virtualization.

Summary

The formulation of the preceding data integration solutions is highly conceptual in nature. The best integration solution requires someone who has practiced these approaches time and again to choose the appropriate solution for your business needs. It is, therefore, critical to work with an experienced group who can guide you through your options and how each option works together--or doesn’t.

Discover how you can begin your journey of successfully integrating your business data by downloading the eBook The Comprehensive Handbook to Breaking Down Data Silos & Transforming Business Intelligence.