How Unstructured Is Powering the LLM Data Stack

Brian Raymond’s always been the person people call when they need access to elusive and esoteric data. 

After an intelligence career in the CIA, Brian went to work at the White House, where he helped President Barack Obama and Vice President Joe Biden make sense of both classified and unclassified intelligence related to terrorism and the Middle East. In that job, he’d tell us over a dinner in San Francisco a few years later, he was responsible for digging up and synthesizing information from mission documents, intelligence briefings and quantitative reports in a way that POTUS could quickly digest and act on. It’s not so far off from his day job today, as the CEO of Unstructured. 

Unstructured unlocks enterprise data that can be used to train, fine-tune, augment and ground large language models (LLMs) that benefit employees and customers. “I want people to think about Unstructured as the ‘easy button,’ or the gateway to using data that’s important to you with LLMs,” Brian says of the company he founded in 2022. It’s common for engineers to build one-off pre-processing pipelines for each new LLM use case, a brittle and time-consuming task. Brian describes Unstructured as automating the least exciting, but most time consuming, area of the LLM stack. It’s absolutely vital to developing enterprise AI, he says, but is also “nightmarishly difficult.”

As it turns out, data pre-processing for LLMs is a universal problem for companies. Unstructured saw over 800,000 downloads in the spring of 2023, and is now used in over 2,500 projects on GitHub. Brian credits the company’s early traction with its willingness to do the “dirty job” of data transformation that many are hesitant to touch. “It played to our benefit that we’re unsexy,” he says.

Brian’s blend of commerciality with deep technical perspective is a rare mix. The clarity of thought and action that he and the Unstructured team bring to the space has excited us since the very beginning, when we partnered at the seed. Now, we’re excited to invest in the A and welcome friends from Madrona, Shield, M12, Langchain, Weaviate and more.

Meet the Founder: ‘From the CIA to AI’

(BCV)

It takes grit to solve a deep, technical problem that many see as perfunctory data munging, and Brian and the Unstructured team have it.  

Brian attributes his persistence to his time as a P&L leader at Primer, one of the leading providers of natural language processing (NLP) platforms to both the public and private sectors. “I was able to look over the horizon and see around the corner a little bit,” Brian says. “Every organization generates enormous volumes of natural language data each day. But it’s been cost- and time-prohibitive to take these things like slides, PDFs, emails and Slack messages and make them available to machine learning models.”

Zeroing in on the idea that AI is only as good as the data it can access, the idea for Unstructured was born. Brian recruited as the company’s first engineers Crag Wolfe, then-director of architecture and infrastructure at Primer, and Matt Robinson, a PhD data scientist and ultramarathon runner who also came to Primer from the CIA.

Just a few months after, OpenAI released ChatGPT to the world, jump-starting the current interest in LLMs. Demand for what Unstructured was building spiked dramatically, as teams  looked for faster, more scalable ways to leverage AI on their own data.

How Unstructured Works: Bringing order to chaos

Unstructured allows companies to build better LLMs. (Unstructured)

Modern foundation models, from open-source examples like LLAMAv2 and Vicuna to hosted ones like Claude_v2 and GPT-4, are trained with billions of tokens of data gathered from the public internet. The result is a set of models that are amazingly general-purpose, seemingly able to adapt to any domain with enough well-curated data examples passed into the context window (the LIMA paper has shown promising results with fewer than 1000 examples!). But this malleability brings a tenuous relationship with groundedness. 

Enterprises have turned to vector databases and retrieval augmented foundation models to make LLMs more attributable and less hallucinatory. But what do you put in a vector database? If you try to directly embed non-textual unstructured information, like PowerPoints, PDFs, images and webpages, you’ll end up with embeddings that aren’t actually that useful in powering the nearest-neighbor search that vector databases specialize in. 

Brian started Unstructured with the goal of helping engineering and data teams quickly and easily clean, strip, segment and format a variety of ubiquitous file types into a standard JSON response that can be used to train, fine-tune or prompt a language model. To do that, the Unstructured team have built an in-house vision transformer model that improves with each new file, and recently launched a hosted API to make usage easier.

The result is a tool that allows companies and organizations of all sizes to maximize the potential of their data and build specialized chatbots and LLMs tuned to their operations.

What’s Next: Building the new foundation

Unstructured integrates with products like AWS, Azure, Dropbox, Office and OneDrive, which are core systems of record that are updated constantly, as organizations work and evolve. Unstructured sees a large opportunity ahead in making this data available in real-time across more systems. 

Brian says that he and the team aspire for Unstructured to become the standard extract, transfer and load (ETL) system for unstructured data. The product today is well-suited to transforming batch data, but the team is focused on making it part of a larger pipeline for connecting LLMs to the latest data in real time.

As Unstructured evolves into the ETL platform of choice for LLM developers, Brian says, the company will continue to focus on solving the unique data problems that affect AI systems.

“We’ll be flexible as we grow, but there’s a reason no one else does this,” he says, laughing. We’re honored to join the Unstructured team on the mission.