What is an enterprise data warehouse?
“an enterprise data warehouse (EDW), is a system used for reporting and data analysis, and is considered a core component of business intelligence. (E)DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place that are used for creating analytical reports for workers throughout the enterprise.” - wikipedia
One of the biggest problems with Enterprise Data Warehouses (EDWs) is that they are being used in ways that were not intended. This is increasing the cost of them for many organisations.
EDWs generally work on specialised hardware meaning the cost of storage is much higher than other storage solutions. If companies don’t use a separate infrastructure to deal with data integration tasks, such as the processing of large volumes of data through extract, transform, and load (ETL) jobs, it tends to be lumped into the EDW. This compromises the EDWs performance and the speed at which customers can retrieve the analytical data they’re looking for.
Read more about Data Warehousing.
Why an EDW alone can be problematic
EDWs & Data Quality
Data has to be properly structured before being loaded into the EDW. Because data is often being loaded from multiple sources, such as IoT devices or social media, much of it is either semi-structured or unstructured data. This can result in inaccurate reporting and analysis. To maintain accuracy, time has to be spent on hand-coding any unstructured data, otherwise once it’s known that the data analysis is inaccurate, staff can lose confidence in the EDW, making the investment in it pointless.
EDWs & Data Governance
EDWs lack the ability to provide data lineage. Knowing who owns what data is essential for many compliance regulations. It’s also a way to ensure no-one has tampered with the data.
Of course this information can be added by attaching tools to the workflow, but it’s sometimes missed, with the feeling it can be added at a later date, which might never happen. If there’s no knowledge of data ownership how do you know who is to be held accountable if there’s an issue?
So what’s the solution?
Why You Should Offload Data Integration & ETL Workloads from EDWs
Many organisations are now looking into cost-effective ways to optimise and offload those integration workloads that were never meant to run on EDW. One of those solutions is Hadoop.
What is Hadoop?
“Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.” - SAS
However, Hadoop, and similar platforms, need to be seen as more than just a cost-effective alternative where you can offload your ETL workloads and unused data from expensive EDWs. It’s vital to optimise your ‘data lake’ by following best practises so that you have useful and usable results.
Best Practises for Optimising Your Enterprise Data Warehouse
When choosing your platform you should be investing in one that can provide the data quality and data governance your company needs to ensure trust throughout your workforce. You also need to be looking at a platform that can integrate various data sources and formats. It needs to be able to manage any future requirements of your business and increase productivity, maintain high quality data and provide actionable insights.
By following best practises you could see plenty of benefits. If you offload ETL workloads from EDWs you:
- Have a more cost-effective platform
- Free up expensive EDW resources
- Should see that warehouse performance improves for the normal workload and the reporting and analytics it was designed for.
Does your business need a data integration health check?
If you successfully offload your data integration workloads and unused data from expensive EDWs to platforms like Hadoop, this allows you to do the following:
Store data for longer periods of time.
Hadoop can store rarely used or historic and new data, providing you with the opportunity to run analysis, analytics and integration tasks, rather than paying more to scale up EDW storage.
Incorporate more data and additional data sources.
Offloading your data integration workload to Hadoop means you can add new data regardless of whether it’s structured or unstructured data or a combination of both.
Ensure correct levels of data governance by directly incorporating data quality controls.
To ensure trust in the data you should incorporate the correct data governance and quality controls. Knowing the lineage and ownership of your data means when any issue arises IT can find the origin.
Make sure the platform you choose meets your current needs whilst also having the ability to offer support for any of your future requirements. For example you might want to introduce artificial intelligence (AI) and machine learning (ML) into your organization, or you might require support for open-source software projects or multicloud environments with connectors and APIs.
Optimising your Enterprise Data Warehouse by offloading those data integration and ETL workloads that are better managed on other platforms is essential to have a cost-effective, future-proofed, reliable and trusted data lake for analysis that can provide actionable insights to move your company ahead of your competition.