27 June 2023

LLMs and the Data Stack

The development of Large Language Models (LLMs) has created exciting opportunities to revolutionize modern data stacks.



LLMs allow enterprises to be more nimble in leveraging data, and make better business decisions, to gain competitive advantages. They have the potential to change the architecture of data stacks and have implications on data lineage, data quality, data security and observability. While this poses a threat to companies who address these issues using “pre-LLM” technology, it creates an opportunity for founders to re-think solutions leveraging the power of LLMs. 


A modern data stack is a set of tools and technologies used by organizations to store, process, and analyze data. It typically includes cloud-based platforms, databases optimized for specific data types such as No-SQL databases, graph databases, and more recently, vector databases. These databases are complemented by tools for cataloging, lineage and quality, governance, and observability. Analytical and business intelligence tools use these databases and tools to deliver insights. 


LLMs, on the other hand, are pre-trained artificial intelligence models capable of understanding human language at a large scale and complexity. They have enormous capabilities in terms of automation of tasks, drawing inferences, and generating documents. They can be used to drive business decisions. Combining the power of LLMs with modern data platforms, therefore, has huge promise.


The integration of LLMs into modern data stacks has led to more advanced natural language processing, better customer experiences, and improved business outcomes. LLMs have made it easier to process and analyze large amounts of data quickly and accurately, freeing up analysts and data scientists to focus on higher-level analysis and decision-making. They have also improved the accuracy of natural language processing and text analytics, enabling organizations to extract meaning from unstructured data sources such as social media and customer feedback. Using LLMs to query databases using natural language allows business users who may not know SQL to use the power of data in making business decisions. Additionally, LLMs have improved the quality of predictive models by identifying correlations and patterns that may not be apparent to human analysts.


The integration of LLMs is bringing significant changes in the architecture of modern data stacks. Enterprises are looking for ways to use LLMs to improve data quality, cleaning, and pre-processing. To use the full power of LLMs requires an adjustment to the underlying databases used to support LLMs. One such database is the emerging field of vector databases. Vector databases can be used as repositories of inferences from LLMs. Another use of vector databases is to use AI to prevent and stop security attacks. It is still unclear whether a vector database would represent a full-blown forking in the long-term, given there are parallel and somewhat contradicting shifts in the broader database segment. 

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas a fringilla tortor, et porttitor tort. Vestibulum non nisi interdum, blandit dolor in. laoreet magna. Suspendisse sit amet elit sit amet nisl. semper imperdiet. Suspendisse

As explored by unstructured.io and illustrated in the diagram below, a new technology stack has emerged that is specifically designed to leverage the full capabilities of LLMs. This new stack comprises four essential components: a data pre-processing pipeline, an embeddings endpoint with a vector store, LLM endpoints, and an LLM programming framework. The data pre-processing pipeline has been modified to meet the unique requirements of language models, including language-specific pre-processing steps, data augmentation techniques, and the generation of word embeddings. The embeddings endpoint and vector store are necessary to provide LLMs with the necessary embeddings for text generation and semantic search tasks. The LLM endpoint is responsible for processing text using pre-trained language models, enabling various NLP tasks. Finally, the LLM programming framework facilitates the development, deployment, and management of LLM-based applications, providing an interface for developers to interact with the models and customize them for specific use cases.

As LLMs continue to advance, some organizations, especially those in highly regulated industries, have taken steps to prevent the exposure of their proprietary company data and/or the private information of their employees, partners, and customers to LLMs. The concern is equal to both proprietary providers such as OpenAI or open-source LLMs such as the recently released Dolly model by Databricks. There is potential for significant security breaches and exposure unless there are tools that obfuscate proprietary data in a manner that the LLMs can use and generate meaningful answers while maintaining confidentiality requirements. Standard encryption techniques will not work as LLMs will not be able to make sense of the information presented. As such, new methods and techniques are necessary. Also enterprises may consider using private LLMs developed behind their firewall and trained using public and proprietary data to generate more accurate and highly secure answers to business questions. Another reason to strongly consider private LLMs is that it may prove impractical or impossible to move large amounts of data to the location of LLMs. However, developing and using private LLMs requires a high level of expertise that all companies may not possess. This is an opportunity for founders to create technologies to solve the complexity of developing private LLMs. 

In conclusion, the integration of LLMs into modern data stacks can unlock new frontiers and realize the transformative impact in various domains. Enterprises are adapting their approaches and are leveraging the power of LLMs to achieve improved business outcomes. However, it is crucial to approach the integration of LLMs responsibly and ensure robust security practices are in place. The future of data stacks lies in the integration of LLMs, and organizations that embrace this integration are likely to stay ahead in the rapidly evolving data landscape.