So You Think You Can Prompt?
Pragmatic methods for getting AI systems to work in the real world.
We recently hosted a “So You Think You Can Prompt” mini-conference, which was a semi-official “last session” in the excellent Mastering LLMs course created by Dan Becker and Hamel Husain. Our goal of the mini-conference was to learn about pragmatic methods for getting AI systems to work in the real world.
We started the day with three fireside chats featuring Daniel Svonava from Superlinked, Shishir Patil of Gorilla fame, and Raza Habib from Humanloop. In this post, we’ve pulled out a few highlights, along with low-quality video of their high-quality talks. (We ended the day with a prompt hack contest, and you can see highlights in Sasha Sheng’s post.)
We endeavored to have frank and tactical conversations, and each talk has a few surprises. Enjoy and listen to the full highlight reel here!
Daniel Svonava on retrieval
The “bitter lesson” for retrieval
Daniel: By expressing all the signals of both the data and the expectations of the goals of the query in the vectors on both sides, then you can just do cosine similarity to navigate those tradeoffs…
Bryan: Is what you’re pushing on here—the bitter lesson—but applied to the embedding space too?
If you’re retrieving documents for a user, you might start with a simple chunked text embedding over which you run a distance-based similarity search. But, straight from the start, you know that there’s a bunch of metadata that you want to use too. You might care about when the document was created, how often it is retrieved, who created it, etc. The temptation is to hand-engineer some filters (e.g. only run search over docs created in the last two weeks) or to do some kind of last-mile re-ranking (e.g. tell a LLM to prefer recently created documents, especially if the doc author is the user running the search).
Daniel argues that this is probably overdone. When you can, it’s better to embed the metadata long with the data, and run vector search over this new (bigger / more complex) embedding. You can’t always do it (as we talk about next), but on average, people are probably trigger happy about adding ad-hoc filters, re-ranking systems, etc.
Step 1: Personalization, Step 2: ???, Step 3: Profit
Bryan: What about those of us who care about personalization?
Daniel: Personalization is really important… we should try to understand what [users’] goals are… This is one area where reranking is helpful. There are ways to do it that are complementary to reranking. You can try to figure out how to make a user vector, how to aggregate observations of what the user is doing, clicking on, etc. and then aggregate those observations into a vector and use that to bias the search vector.
User A and User B write the same search string. Should you retrieve the same documents for each of them? Probably not. These users probably have different goals, and you ought to personalize the retrieval. How? Daniel gets into this a bit more in the conversation, but the short answer is to think of ways that you can embed the users’ preferences into the search vector space. Reranking on its own may not be enough.
Just concatenate the vectors
Bryan: We are the benefactors of very powerful embeddings that are only an API call away — truly incredible, deep models. Our collective understanding of those latent spaces is comically low… And what you’re now saying is, “We’re going to make these latent spaces way more complicated. These encoders way more bespoke and non-intuitive. And frankly we’re going to cram way more intuition, or vibes, into these vectors.”
…
Daniel: Just concatenate the [embedding] vectors together!
If you take Daniel’s advice, you’ll soon find yourself with vectors coming out of your ears. You need to embed the original document. You need to embed the user’s personalization data. You need to embed the document metadata, etc. How can you combine all these vectors into something that you can do cosine-similarity search over? What if you… just… concatenate them all together? (I… worked on these vectors for a year… and… he just… he just concatenated them together.) It’s not as crazy as it sounds, as Daniel explains.
Shishir Patil on LLM tool use
Sweating the details
Shishir: We had some users who wanted to call and API to determine the [monthly payment] a person would have if you were to key in their principal and interest rate… We know which API to call, but then once you call the API, should your interest rate as a field be 0.07 or 7? Is it 7 percent or is it 0.07 floating point value?
Shishir points out that function calling in the real world is a pretty complex task — it’s not just hitting a calculator app and calling it a day. You have to decide which API to use, you need to create a valid request for that API, and (among valid requests) you need to actually use the API correctly, as in the mortgage example above. This is all before getting to the actual user intention (e.g does the API request actually fulfill the user’s high-level ask?). It sounds hard. If only there were a group at Berkeley working on this problem…
Training on the test data
Sasha: I’m curious if you think that the AI labs are training on your function calling data set.
Shishir: Oh, absolutely… This might be a modern take, but I feel it’s OK to train on [that] data. Why did we not want to do this before? It’s because our training data was sparse; you’re trying to train with ImageNet, CIFAR, these small data sets. So you had a fear of whether [the model] would generalize, whether it would memorize, etc. But today, if you’re trying to train on 15T tokens…
People post-train their models on Berkeley’s public function calling dataset. The private test dataset is very similar, so one might worry whether model trainers are over-fitting to Berkeley’s particular flavor of function calling.
But it’s fine. In fact, over-training on the kinds of function calls that people need in practice is probably a good thing. Function calling is a specific enough problem that if you make the right API calls on a large, diverse corpus of test cases, your model will probably generalize.
Raza
I am active learning and so can you
Bryan: What are your thoughts on active learning for evaluation?
Raza: […] Not every data point is created equal. If your model already understands something perfectly well, why would you label that again? You want to go label stuff that the model will actually learn from… You can use a similar concept when it comes to evaluation.
Your model will be uneven: it’s going to perform better on some problems, and worse on others. What you’d like to do is over-train on the examples where your model underperforms. A similar idea holds in evals: you’d like to stress test the model exactly where it performs worse.
This raises the question of how to find or create these specific training examples, which might be a very small percentage over your overall corpus. One thing you can do is pretend that the model’s predicted labels are the true labels. You will be excommunicated from the Church of Avoiding Sampling Bias, but it works — as Raza explains.
It’s the system, man
Raza: The LLM doesn’t need to be good at everything. Thinking of the LLM as the whole system is a mistake.
LLMs are not good at everything. And that’s fine. That’s why we build things like tree search + NN (alpha zero). That’s why we care about retrieval, function calling, planning, etc. Raza argues that we aren’t going to get away from systems anytime soon.
Affording affordances
Raza: One of the huge breakthroughs of InstructGPT was taking something that had within it lots of latent capabilities and making it seem more human-like — and therefore easier to interact with. Humans can reason about how it’s going to behave, which is very helpful for interacting with the system. I’ve often wondered for the neural networks that we had before that we were training on image generation or whatever it is that came before LLMs — how much capability was in there that we weren’t able to leverage because the only way we had to interact with these things was to sample the logits from the final layer.
A lot of the experienced usefulness in LLMs doesn’t come from capability gain, it comes from giving humans the knobs and levers that they understand how to use. InstructGPT didn’t really add any capabilities to GPT-3, but it made it easy for a human to understand how to use it.
Want to come to an event like this in the future? Follow Bryan, Slater, and BCV on Twitter.