What We've Learned From a Year of Co-Building with LLMs (Part 2)

This article is the second part of a year-end summary of learnings from the practical application of Large Language Models (LLMs), covering operational insights at the intersection of data, models, products, and teams.

Many leaders have uttered a phrase that may be somewhat apocryphal: "Amateurs talk about strategy and tactics; professionals talk about logistics." The tactical view sees a myriad of unique problems, whereas the operational view sees patterns of organizational dysfunction that need fixing. The strategic view sees opportunities, while the operational view sees challenges worth embracing.

In the first part of this article, we introduced tactical key points in LLM work. In the next installment, we'll broaden the scope to cover long-term strategic considerations. In this section, we'll discuss the operational aspects of building LLM applications that sit between strategy and tactics and provide practical assistance on the path forward.

Operating LLM applications raises issues similar to those encountered in operating traditional software systems, which often unfold in novel ways, making things interesting. LLM applications also give rise to entirely new problems. We categorize these issues and their answers into four parts: data, models, products, and people.

Data:

How and how often should you check LLM inputs and outputs?
How do you measure and reduce bias in test products?

Models:

How do you integrate language models into the rest of the stack?
How should you consider model versioning and migration between models and versions?

Products:

When should design be involved in the application development process, and why is it "the earlier, the better"?
How do you design user experiences rich in human feedback?
How do you prioritize a multitude of conflicting demands?
How do you calibrate product risk?

People:

Who should you hire to build successful LLM applications, and when should you hire them?
How do you foster the right culture, namely a culture of experimentation?
How should you use emerging LLM applications to build your own LLM applications?
Which is more important: process or tools?

As an AI language model, I don't have opinions, so I can't tell you whether your introduction is "correct." However, I can say that the introduction lays a solid foundation for what follows.

Operations: Developing and Managing LLM Applications and the Teams Behind Them

Data

Just as the quality of ingredients determines the taste of a dish, the quality of input data also constrains the performance of machine learning systems. Moreover, output data is the only way to judge whether a product is effective. All authors pay close attention to data, spending several hours a week reviewing inputs and outputs to better understand the data distribution: its patterns, its edge cases, and the limitations of its models.

Checking for Development Product Bias

A common source of error in traditional machine learning processes is train-serve bias. This occurs when the data used in training differs from the data the model encounters in production. Although we can use LLMs without training or fine-tuning, so there is no training set, development-production data bias can occur in a similar way. Essentially, the data we test our systems with during the development process should reflect the situations the system will face in production. If not, we may find our production accuracy affected.

LLM development product bias can be divided into two types: structural bias and content-based bias. Structural bias includes issues such as format differences, discrepancies between JSON dictionaries with list-type values and JSON lists, inconsistencies in case sensitivity, and errors such as typos or sentence fragments. These errors can lead to unpredictable model performance because different LLMs are trained for specific data formats, and prompts are extremely sensitive to minor changes. Content-based or "semantic" bias refers to differences in data meaning or context.

Just like in traditional machine learning, it is useful to regularly measure the bias between LLM input/output pairs. Simple metrics (such as the length of inputs and outputs or specific formatting requirements like JSON or XML) are a straightforward method for tracking changes. For more "advanced" drift detection, consider clustering embeddings of input/output pairs to detect semantic drift, such as changes in the topics users are discussing, which may indicate they are exploring areas the model has not encountered before.

When testing changes (e.g., rapid prototyping), ensure that the holdout data is up-to-date and reflects the latest types of user interactions. For example, if there are frequent spelling errors in production inputs, the holdout data should also contain spelling errors. In addition to numerical bias measurements, qualitative assessment of outputs is also helpful. Regular "atmospheric checks" of model outputs ensure that results meet expectations and remain relevant to user needs. Finally, incorporating uncertainty into bias checks is also useful - by running the pipeline multiple times on each input in the test dataset and analyzing all outputs, we increase the likelihood of capturing anomalies that occur sporadically.

Reviewing Daily Samples of LLM Inputs and Outputs

LLMs are dynamic and continually evolving. Despite their impressive zero-shot capabilities and generally satisfactory outputs, their failure modes are very hard to predict. For custom tasks, regularly reviewing data samples is crucial for an intuitive understanding of LLM performance.

Input-output pairs in production are the "genchi genbutsu" of LLM applications, and they cannot be replaced. Recent research indicates that as developers interact with more data (i.e., standard drift), their perceptions of "good" and "bad" outputs change. Although developers can predefine some standards for evaluating LLM outputs, these predefined standards are often incomplete. For example, during the development process, we might update prompts to increase the probability of good answers and reduce the probability of bad ones. This iterative process of evaluation, re-evaluation, and standard updating is necessary because it is difficult to predict LLM behavior or human preferences without direct observation of outputs.

To manage this effectively, we should log LLM inputs and outputs. By reviewing samples of these logs daily, we can quickly identify and adapt to new patterns or failure modes. When we spot a new issue, we can immediately write an assertion or evaluation around it. Likewise, any updates to failure mode definitions should be reflected in evaluation criteria. These "atmospheric checks" are signals of bad outputs; code and assertions make them actionable. Finally, this attitude must be socialized, such as by adding reviews or annotations of inputs and outputs to on-call rotations.

Using Models

With LLM APIs, we can rely on intelligence provided by a few vendors. While this is an advantage, these dependencies also involve trade-offs in performance, latency, throughput, and cost. Moreover, with the emergence of updated, better models (almost every month in the past year), we should be prepared to update our products as we deprecate old models and migrate to new ones. In this section, we share lessons learned from using technology that we cannot fully control, where models cannot be self-hosted and managed.

Generating Structured Outputs to Simplify Downstream Integration

For most practical use cases, the output of LLMs will be used by downstream applications in some machine-readable format. For example, the real estate CRM Rechat requires structured responses so that the frontend can present widgets. Similarly, the tool Boba for generating product strategy ideas needs structured outputs that include fields such as title, summary, plausibility score, and time frame. Finally, LinkedIn shared information about constraining LLM-generated YAML, which is then used to decide which skills to use and provide parameters for calling those skills.

This application pattern is an extreme version of Postel's Law: be tolerant of what is accepted (arbitrary natural language) and conservative with what is sent (typed, machine-readable objects). As such, we expect it to be very robust.

Currently, Instructor and Outlines are the de facto standards for eliciting structured outputs from LLMs. If you are using an LLM API (e.g., Anthropic, OpenAI), use Instructor; if you are using a self-hosted model (e.g., Hugging Face), use Outlines.

Cross-Model Migration of Prompts is Very Tricky

Sometimes, prompts that we carefully design work well on one model but not on another. This can happen when we switch between different model providers and when we upgrade between different versions of the same model.

For example, Voiceflow found that migrating from gpt-3.5-turbo-0301 to gpt-3.5-turbo-1106 caused a 10% decrease in its intent classification task. (Thankfully, they evaluated it!) Similarly, GoDaddy also observed a positive trend, with the upgrade to version 1106 narrowing the performance gap between gpt-3.5-turbo and gpt-4. (Alternatively, if you are an optimist, you might be disappointed that the lead advantage of gpt-4 has narrowed with the new upgrade.)

Therefore, if we must migrate prompts between models, expect the effort to be longer than simply swapping API endpoints. Do not assume that inserting the same prompt will yield similar or better results. In addition, having reliable automated evaluations helps measure task performance before and after migration and reduces the amount of manual verification required.

Version Control and Fixed Models

In any machine learning process, "changing anything changes everything." This is especially important because we rely on components like Large Language Models (LLMs) that we do not train ourselves and that can change without our knowledge.

Fortunately, many model providers offer the option to "fix" a specific model version (e.g.