It’s been almost one year since a new breed of artificial intelligence took the world by storm. The capabilities of these new generative AI tools, most of which are powered by large language models (LLM), forced every company and employee to rethink how they work. Was this new technology a threat to their job or a tool that would amplify their productivity? If you don’t figure out how to make the most of GenAI, are you going to get outclassed by your peers?
This paradigm shift placed a dual burden on engineering and technical leaders. First there’s the internal demand to understand how your organization is going to adopt these new tools and what you need to do to avoid falling behind your competitors. Second, if you're selling software and services to other companies, you're going to find that many have paused spending on new tools while they sort out exactly what their approach should be to the GenAI era.
It's not easy to know where to turn for advice. Every major tech company is pushing out products and platforms at warp speed and startups are proliferating like lantern moths. The only thing growing faster than parameters in neural networks is the number of newsletters promising to cut through the clutter to deliver the “essential” AI news of the day. And let’s not get started on the social media gurus who specialized in advice on NFTs a year ago and are suddenly AI experts today.
There is a ton of hype, and it can be exhausting trying to figure out where to direct your resources. Before you can dive into the details of what to do with the answers or art your GenAI is creating, you need a robust foundation to ensure it’s operating well. To help, we’ve come up with four key areas you’ll need to understand to make the most of the time and resources you invest.
- Vector Databases
- Embedding Models
- Retrieval Augmented Generation
- Knowledge Bases
These are almost certain to be fundamental pieces of your AI stack, so read on below to learn more about the four pillars needed for effectively adding GenAI to your organization.
- To make use of a Large Language Model, you’re going to need to vectorize your data. That means the text you feed into the model is going to be reduced to arrays of numbers, and those numbers are going to be as a vector on a map, albeit one with thousands of dimensions. Finding similar text is reduced to finding the distance between two vectors. This allows you to move from the old fashioned approach of lexical keyword search—typing a few terms and getting back results that share those keywords—to semantic search, typing a query in natural language and getting back a response that understands a coding question about Python is probably referring to the programming language and not the large snake.
- “Traditional data structures, typically organized in structured tables, often fall short of capturing the complexity of the real world,” says Weaviate’s Philip Vollet. “Enter vector embeddings. These embeddings capture features and representations of data, enabling machines to understand, abstract, and compute on that data in sophisticated ways.”
- How do you choose the right vector database? In some cases it may depend on the tech stack your team is already using. Stack Overflow went with Weaviate in part because it allowed us to continue using PySpark, which was the initial choice for our OverflowAI efforts. On the other hand, you may have a database provider, like MongoDB, which has been serving you well. Mongo now includes vectors as part of their OLTP DB, making it easy to integrate with your existing deployments. Expect this to be standard for database providers in the future. As Louis Brady, VP of Engineering at Rockset explained, most companies will find that a hybrid approach combining a vector database with your existing system offers you the most flexibility and the best results.
- Vollet, the Head of Developer Growth at Weaviate, says there is a lot to weigh when starting out. “There are numerous factors to consider when evaluating which vector database to select. For those implementing machine learning pipelines or AI applications, transitioning from prototyping to production brings critical considerations such as horizontal scaling, multi-tenancy, and more. Factors such as data compression or isolation, essential for compliance and data security, must be validated upfront when choosing the right product.”
Want to learn more about vector databases? Check out our deep dive about taking this technology from prototype to production.
How do you get your data into the vector database in a way that accurately organizes it by the content? For that, you’ll need an embedding model. This is the software system which will take your text and convert it to the array of numbers you store in the vector database. There are a lot to choose from, and they vary greatly in cost and complexity. For this article, we’ll focus on embedding models that work with text, although embedding models can also be used to organize information about other types of media, like images or songs.
As Dale Markowitz wrote on the Google Cloud blog, “If you’d like to embed text–i.e. to do text search or similarity search on text–you’re in luck. There are tons and tons of pre-trained text embeddings free and easily available.” One example is the Universal Sentence Decoder, which “encodes text into high-dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks.” With just a few lines of Python code, you can prepare your data for a GenAI chatbot style interface. If you want to take things a step further, Dale also has a great tutorial on how to prototype a language powered app using nothing more than Google Sheets and a plugin called Semantic Reactor.
You’ll need to evaluate the tradeoffs between the time and cost of putting huge amounts of text into your embedding model and how thinly you slice the text, which is usually chunked into sections like chapters, pages, paragraphs, sentences, or even individual words. The other tradeoff is the precision of the embedding model -- how many decimal places to use on vectors, as each decimal place increases size. Over thousands of vectors for millions of tokens this adds up. You can use techniques like quantization to shrink the model down, but it’s best to consider the amount of data and degree of detail you’re looking for before you choose which embedding method is right for you.
If you want to dive deeper, check out this podcast with our head of data science and one of our lead engineers, both of whom worked on the embedding approach for Overflow AI.
Retrieval Augmented Generation (RAG)
Big AI models read the internet to gain knowledge. That means they know the earth is round…and they also know that it’s flat.
One of the main problems with large language models like ChatGPT is that they were trained on a massive set of text from across the internet. That means they’ve read a lot about how the earth is round, and also a lot about how the earth is flat. The model isn’t trained to understand which of these assertions is correct, only the probability that a certain response to a question will be a good match for the query the user enters. It also mixes those inputs into a statistically probable new one, which is where hallucinations can occur. It may be responding with neither response, which is why checking sources is good
With RAG, you can limit the dataset the model searches, meaning the model hopefully won’t be drawing on inaccurate data. Secondly, you can ask the model to cite its sources, allowing you to verify its answer against the ground truth. At Stack Overflow, that might mean containing queries to just the questions on our site with an accepted answer. When a user asks a question, the system first searches for Q&A posts that are a good match. That’s the retrieval part of this equation. A hidden prompt then instructs the model to do the following: synthesize a short answer for the user based on the answers you found that were validated by our community, then provide the short summary along with links to the three posts that were the best match for the user’s search.
A third benefit of RAG is that it allows you to keep the data the model is using fresh. Training a large model is costly. Many of the popular models available today are based on training data that ended months, or even years ago. Ask it a question about something after that, and it will happily hallucinate a convincing response, but it doesn’t have actual information to work with. RAG allows you to point the model at a specific dataset, one that you can keep up to date without having to retrain the entire model.
RAG means the user still gets the benefit of working with an LLM. They can ask questions using natural language and get back a summary that synthesizes the most relevant information from a vast data store. At the same time, drawing on a predefined data set helps to reduce hallucinations and gives the user links to the ground truth, so they can easily check the model’s output against something generated by humans.
This approach works for whatever domain your organization is focused on. Lawyers can use RAG to ensure their search is confined to verified legal documents, and doesn’t draw on John Grisham fanfiction the model might have found while training on text from around the web. A company with a medical focus can constrain their search to scientific research, avoiding answers the model may have learned while reading scripts for old episodes of General Hospital. And of course, it makes it easier for companies to query proprietary data without worrying their private information could become part of another companies training data.
If you want to learn more about how we use techniques like RAG, check out our blog post on making the transition from lexical to semantic search on Stack Overflow. If you’re looking to benchmark your RAG pipelines, check out this interesting project.
As mentioned in the previous section, RAG can constrain the text your model is drawing on when generating its response. Ideally that means you’re giving it accurate data, not just a random sampling of things it’s read on the internet. One of the most important laws of training an AI model is that data quality matters. Garbage in, garbage out, as the old saying goes, holds very true for your LLM. Feed it low-quality or poorly organized text, and the results will be equally uninspiring.
Here at Stack Overflow, we kind of lucked out on the data quality issue. Question and answer is the format being adopted by most LLMs used inside organizations, and our dataset was already built that way. Our Q&A couplets can show us which information is accurate and which is still lacking a sufficient confidence score by analyzing the number of votes or which question has an accepted answer. Votes can also be used to determine which of three similar answers might be the most widely-utilized and thus the most valuable. Last but not least, tags allow the system to better understand how different information in your dataset is related.
If you want to dive deeper on this topic, we’ve got a great article explaining why Knowledge Management is foundational to AI success. And if you want to explore how other organizations are tackling this issue, check out our conversation with Sorcero, a company focused on using AI to help make the latest medical research more widely available and understandable.
Ok, you’ve made it through our primer on the four pillars you’ll need to understand to successfully implement GenAI in your organization. If you want to learn about our own experience exploring GenAI, check out this recorded session, Stack Overflow’s AI Journey: Lessons Learned on the Road to GenAI. And of course, if you are interested in adding GenAI to your tech stack but don’t feel you have a great knowledge base set up, check out Stack Overflow for Teams. Our platform is used by organizations like Microsoft and Bloomberg to help their engineers document internal knowledge, improve collaboration, and boost productivity.
Learn more about Stack Overflow for Teams
Trusted by top Fortune 100 companies, Stack Overflow for Teams empowers organizations for an AI future.