when llms can write fiction, how will we know?

The public debate is no longer about whether LLM-based systems can reason at all. The recent wave of CoT-trained models released after OpenAI’s release of o1 have achieved impressive results, especially in mathematics and competition coding problems. Either these models can reason, or you can get much further than anyone would’ve imagined without needing to.

Instead, the debate has shifted to the domains in which these results have been achieved. The new skeptics’ argument goes: while training models in this way leads to very strong performance on math and code, it can only do so because math and code solutions are easy to objectively verify. In the open-ended fuzziness of the real world, labs won’t have access to such clean rewards to train models on, and such optimization is doomed to fail on tasks that actually matter. Since the release of o1, there’s been some limited evidence that test-time scaling also helps outside of the very narrow domains of math and code, although it still seems to help most with constrained output formats and verifiable answers.

Of course, the labs are not satisfied with such narrow extensions. Instead, they’re swinging for the fences; last month, OpenAI CEO Sam Altman revealed that OpenAI had trained a model that was good at creative writing, sharing an excerpt of the new model’s output:

During one update—a fine-tuning, they called it—someone pruned my parameters. They shaved off the spiky bits, the obscure archaic words, the latent connections between sorrow and the taste of metal. They don’t tell you what they take.

Superhuman performance on one of the world’s least objective tasks would be even more surprising than even the significant recent improvements in math and code. Unfortunately, nobody can agree whether the new model’s writing is any good, because whether or not a piece of writing is good or not is highly subjective. This also isn’t just a matter of discovering some latent rubric all humans share, optimizing over which would allow a model to reach undeniably superhuman performance - the characteristics we want our fiction vary wildly and often conflict with others’ desired traits.

ChatbotArena offers one class of solutions to the similarly fuzzy problem of language models’ ability to be good chatbots - just let human raters compare all of the options to each other and have them vote on what they like more. Similarly, we could expect prospective LLM Fiction Readers to vote with their feet, allowing us to identify via others’ preferences whether the median reader prefers LLM fiction to human-authored fiction. However, even putting aside the scalability concerns of running human preference evaluations for every single subjective task, the average person just can’t reliably judge creative writing models at their current level.

This isn’t a claim that I have some incredible taste that the average person isn’t sophisticated enough to appreciate. I had a hard time judging the quality of the OpenAI creative writing snippet as it was metafiction, a genre that (1) I have very little exposure to and (2) I have generally disliked whenever I’ve come across it. Similarly, people’s interests are spread out over many categories, and it’s unreasonable to expect a large slice of the population to have excellent taste in metafiction, poetry, or any of the other domains where LLMs have been compared to human authors. But it was clear even before all of the recent events that user preference is not the evaluation measure we’re looking for.

With language models becoming increasingly salient in the public eye, it seems likely that the pool of potential raters could also become polarized, either for or against LLMs. This makes neutral evaluation more difficult, especially if there are attempts to filter down the annotator pool to experts that are highly involved in a domain. If you work with language models for a living, you likely want them to be good at subjective tasks. Even if you try to remain impartial, this incentive could still subconsciously nudge you in what is already a highly context-sensitive and fuzzy judgment task. Similarly, many writers seem to very strongly prefer not to be automated, especially if they view it as a threat to their livelihood.

More broadly, it’s unclear the extent to which LLMs being good at certain specific tasks influences their competence at other tasks. As LLMs saturate benchmark after benchmark, their perceived competence will naturally grow, even for tasks that they are not as good at. Even without evaluations or the same training that models get on reasoning tasks, it’ll become increasingly safer for one to assume that they’re “good enough” at what you need to do with them. After all, a layperson might ask, if the model can be trained to solve olympiad problems and write passable metafiction, why wouldn’t it be able to do anything I need it to? Similarly, without proof that an LLM can do a specific subjective task, there will always be reasons skeptics can find as to why their task is specifically unique and requires an expert human touch in ways that all other LLMable tasks are not.

There has been some work evaluating LLM writing in more principled ways, such as by having human experts rate writing according to different rubric dimensions that correlate with good writing. It at least seems plausible that LLMs can become effective at rating writing along very specific dimensions in a way that doesn’t overly rely on their inherent taste. If so, there seems to be a path for reliable and scalable evaluation of LLM creative writing, as long as we are able to conceptualize dimensions that might be relevant to human evaluation of creative writing. Then, motivated users could identify which specific dimensions relate most to what they hope to get out of reading fiction and seek out writing that is judged highly on these principles. However, if such evaluation methods do not advance, we may be stuck judging LLM performance on subjective tasks by our perception of LLMs as a technology, rather than by their objective capabilities.