It happened again.
One of your executives or analysts needed some data.
They didn’t have the data on-hand so they put in a request with the data team.
The data team wasn’t able to quickly and easily access the requested data either.
Instead, they had to spin up a brand new project just to retrieve it.
The project was slow, complex, and expensive in both human and material costs.
Worst of all, by the time the data team completed the project and retrieved the requested data, those data were stale and unusable for their intended purpose.
Your business had to write off all the wasted effort that went into the project.
And then, a new request for data came in, and the cycle repeated.
How to Break the Cycle and Make Data Accessible Again
This cycle is becoming all too common.
Data has become the primary driver for every business.
Businesses are collecting more data than ever and using that data to power more and more of their customer-facing products and their internal decision-making processes.
But as businesses collect and attempt to leverage more and more data, they are struggling to access that data in a timely manner to derive competitive advantage from it.
And it’s not their data teams’ fault. Businesses are hiring the best of the best and their teams are working overtime to make the data they are collecting usable. Data is just growing too big, fast, and diverse for any data team to wrangle with ad-hoc projects.
The problem is not a lack of talent. It’s far more fundamental — and so is the solution.
In this article, we will point to a new solution approach that can make data more accessible.
To do so, we will explore:
- Why organizations are experiencing this accelerating data access crisis.
- Why current market approaches to solving this crisis won’t work.
- What a more effective and sustainable solution approach looks like.
Why Data Access Has Become a Challenge Again
Before we begin, let’s make one point clear — data access is not a new challenge.
Data has always been distributed and businesses have always been challenged to find a way to access their data in a fast and easy manner, no matter where it lived.
But for a long time the data access challenge was effectively solved.
In the 1970s a single standard for data access emerged — Structured Query Language, or SQL — along with standards such as ODBC & JDBC to invoke SQL queries from applications.
SQL offered a “good enough” solution to the data access challenge for 40 years, as long as the data remained in a single database or a data warehouse. But over the last 5 – 10 years, four major trends have transformed the data landscape. They have turned a monolithic SQL interface into an insufficient solution to the data access challenge, and they have turned that once-isolated challenge into a business-wide crisis.
First, data has become far more important to the business.
Today, almost every business uses data to power almost every one of their products or decisions. Businesses need more data than they used to, and they suffer bigger topline consequences when that data is inaccessible to them.
As a result, there’s a lot more executive-level visibility into the challenge of data access and a lot more pressure put on data teams to solve that challenge. The data access challenge used to be a backend issue. Now it’s become front-and-center.
Second, businesses are using new types of data.
Over the last 10 years new data models and types have emerged that can’t be accessed through the SQL-based single access approach that emerged as the standard for over forty years.
For example, streaming data has become increasingly important, but it does not fit the standard data model, and it is not generally amenable to SQL.
Third, businesses are running new types of data-driven workloads.
During this time period we have also seen the big data phenomenon appear. Businesses now leverage more data than their standard systems are able to effectively manage and maintain.
That data is also being accessed far faster. In the past, businesses used their data to run reports once per day, or once per week. Now, businesses run real-time workloads, and they need real-time data access to power them.
Finally, businesses are adopting new data deployment platforms.
While business data was never centrally located, it used to be stored in a far simpler data architecture. There were three relational databases — Oracle, IBM DB2, and MS SQL Server — and businesses used some combination of them.
All of that has changed. Businesses have adopted a large volume and variety of new platforms to store their increasing volume and variety of data types. Today, businesses are building data storage architectures using tens of platforms. And they are rapidly migrating the majority of the data and workloads to the Cloud.
In sum: The nature of data has changed and so has the challenge of accessing it.
Old solutions to solving the data access challenge no longer work in today’s world.
And, unfortunately, the most common new solution we’re seeing won’t work either.
The Problem with Current Solutions to the Data Access Challenge
As a VC firm, we speak with a lot of businesses about the data access challenge. Many of those businesses have been coming to consider the same solution approach.
They typically see two things:
1. They have been moving more of their data to Cloud storage platforms.
2. Those platforms have been expanding to include data access features.
Businesses are combining these observations — often with nudging from their vendors — and toying with the idea of just centralizing all of their data into one of their Cloud storage platforms, and then using that platform for all of their data access too.
At first, this looks like a viable solution. If a business can finally centralize their data storage, then their data access would naturally centralize too.
But once we dug deeper into the approach, a few issues emerged.
- Migrating everything to one platform is a non-trivial effort. Businesses have accumulated years and years of data on their existing platforms and their applications are speaking directly to those platforms. Also, companies could have accumulated multiple data stores and platforms from acquisitions. Moving all of that to the Cloud is going to be cost and time prohibitive for many businesses.
- Not all data and workloads are well supported by Cloud platforms. Most of these Cloud platforms are good for storing analytical data, but they were not specifically designed for real-time analytical workloads, so they can experience performance loss when certain workloads are migrated to them at scale.
- Vendor lock-in remains a very valid concern. These Cloud platforms are closed systems. Once a business loads their data into the platform, it can become the primary, and in some cases only, data access point. Businesses have struggled with vendor lock-in and they don’t want to repeat the mistake.
These are big issues, and they will be insurmountable for many businesses.
While we understand why businesses might want to finally centralize their data storage, we also must acknowledge that this approach has never been viable, and modern Cloud platforms have not changed this hardwired reality of the data world.
Instead, we must develop an updated approach to perform standardized data access — one that is aligned to the ongoing transformation of our data landscape.
Here’s what the most effective updated approach we have found looks like.
Solving Today’s Data Access Challenge: A New Architecture
The most promising solution approach we have found is the Distributed Data Mesh.
The concept is simple. Instead of gathering all of their data into a single storage platform, a business can leave their data where it currently lives, and create a “mesh” that sits on top of these sources and makes their data accessible through an API.
This mesh establishes a foundation where all of the business’ data is now accessible in one place without moving that data. To fill out the architecture of this mesh, the business can then build multiple individual data access capabilities within it.
Essentially, this mesh will then form a new horizontal layer that sits above the business’ data platforms, integrates all of their data, and performs access functions on them as if they were all a single data platform — without moving the data at all.
This mesh gives businesses the ability to perform data access functions across their diverse, dynamic, and distributed data landscape, and — most useful of all — those functions can automatically scale and fold in new data sources as the business spins them up or acquires via acquisitions and 3rd party integrations.
The mesh itself is flexible and extensible, and can host any number of data access functions. However, we believe there are three core data access functions that any Distributed Data Mesh must incorporate. They are:
- Data Cataloguing and Discovery: Businesses must be able to see all of the data sources, types, and units they have available to them across all of their different data sources. To do so, their mesh must be able to create a catalogue of all of the data that it encompasses, and that catalogue must be searchable.
- Data Governance: Businesses must be able to know which of their datasets they can access and for which purposes based on whatever regulations and compliance requirements they are subject to. To do so, their mesh must be able to perform access policy management on all of the data within it.
- Data Pipeline Building: Businesses will still need to be able to move data from one platform beneath their mesh to another. To do so, their mesh must provide tools to build pipelines by extracting data from one storage platform, load it, then transform it for use on other platforms within the mesh.
By creating a Distributed Data Mesh, and by loading it with solutions that perform these three functions in a distributed manner, and ensuring that it is high-performing, businesses will be able to quickly and easily access the data in their environment without having to go through the substantial effort of changing the location, structure, and performance of that data.
Bringing the Distributed Data Mesh Architecture to Life
Now, the Distributed Data Mesh is not a perfect concept, and it’s not fully mature yet.
There is no single solution on the market that creates these meshes and outfits them with all of the core capabilities required to access modern data landscapes. Businesses must assemble these capabilities from a variety of existing solutions, many of which are open source projects.
And even when the businesses do assemble a Distributed Data Mesh, the mesh itself carries some downsides. There are additional solutions on the market that make the mesh simpler, easier, and more performant, and we consider these solutions to be essential as well.
One of our portfolio companies — Ampool — focuses on these challenges, and has developed a platform that manages the mesh and ensures data engineers and analysts can quickly and securely access the data it makes available. We suspect many more data mesh solutions will appear soon.
However, even in its current immature state we believe in the concept of the mesh. We believe it holds the solution to the problem of data access in modern data landscapes, and we believe the tools that create and populate this mesh represent the next wave of meaningful technology growth.
All of the trends that created the need for the Distributed Data Mesh — including the growing volume of data stores, types, and demands — are only going to increase and accelerate. By establishing a mesh-based architecture now, businesses will both solve their existing problems and future-proof their infrastructure against this ongoing wave of transformation.
If you agree, or if you would simply like to learn more, reach out today. If you are a startup developing mesh-based solutions, reach out and we can discuss whether you might fit into our portfolio of investments in this emerging area.