BookLab — Lab 2: Amanuenses to AI

every figure of speech, snowclone, cliché, joke, proposition, statement, and practically every linguistic structure that can be turned into a template is easily explored with a bot. Every work of literature, every writer’s body of work, every literary movement, national literature, and textual corpus is waiting to be analyzed with a Markov chain or other textual generation technique…Social interactions, conversations, calls and responses, platform-defined interactions (retweets, favorites, and so on) are all ready to be codified into algorithms and explored via bot.

Leonardo Flores, “I ♥ Bots”

Acknowledgment

The title for this lab, and indeed much of the inspiration, comes from Annette Vee’s course, Automating Writing from Amanuenses to AI.

Software to Explore

ChatGPT: https://chat.openai.com/chat
Open AI Playground: https://beta.openai.com/playground
Moonbeam: https://app.gomoonbeam.com/
NovelAI: https://novelai.net/
GPT-J 6B: https://huggingface.co/EleutherAI/gpt-j-6B
YouChat: https://you.com/search?q=who+are+you&fromSearchBar=true&tbm=youchat
if you have some programming background and access to an LLM’s API, feel free to use that directly

Why Start with AI?

We begin this semester at the nearest end of the technological spectrum to ourselves, exploring some of the Large Language Models (LLM)—often referred to through the shorthand “AI”—that have emerged in just the past year. As scholars such as Jessica Riskin explain, however, humans have sought ways to automate intellectual or creative tasks—or to simulate such automation—since well before the computer age, and indeed our current computer systems’ design owes much to this history. What’s more, even AI systems that may seem completely opaque have complex but knowable material bases, as scholars such as Kate Crawford and Vladan Joler show.

In many cases, it is neither the technology or data itself that is the “black box,” so much as it is the corporate structures that prevent scholars or users from interrogating the technologies or their underlying data. It is difficult to write a media archeology of an AI like ChatGPT without direct access to its training data or systems. But even the most closely-guarded systems can be studied. Increasingly researchers interested in algorithmic justice have turned to approaches such as “algorithmic auditing”, “reverse engineering”, or “probing” to interrogate closed systems, seeking to understand a program and its training data through iterative uses or inquiries that test its boundaries and assumptions. On-the-ground investigation also helps illuminate systems, such as Time Magazine’s expose in January 2023 exposing OpenAI’s use of low-wage Kenyan workers to help filter toxic content from ChatGPT.

Aims of the Lab

With that context, our aim today is to explore a few of the most current AI language models.

Objective 1

This lab has only one practical outcome, which is:

In tandem with a large language model, produce a short poem or “literary” passage that you find interesting. Whether “interesting” means you find it funny, or moving, or surprising, or weird, or disturbing, or otherwise is up to you. But your text should:

Be a hybrid production: some combination of your prompting and a LLM’s output.
If a poem, be 1-2 stanzas in length; if prose, a few sentences or short paragraph. LLMs can produce longer texts, which you should feel free to explore, but the deliverable for the lab should be a shorter text

We will discuss this together in class, but note that the way you produce text with these different platforms can be different. ChatGPT works, as its name implies, more like a chat app: you request an output, and the model responds. An app like Moonbeam is designed to help writers, so you will write some text and then ask the model to expand that text, or summarize it, or so on.

As you work, be sure to save candidates for your final choice—while you may be able to scroll up in the interface and find them, some of these online models are heavily trafficked and browser windows do crash regularly. You should save your final choice somewhere and include it in your write-up.

Objective 2

As we work toward that practical goal, however, our next goal is to begin probing the boundaries and assumptions of LLMs through comparison of different inputs and outputs. Certainly in the limited time of this lab we will not come to any large-scale conclusions about these models, which are large and complex. But we can begin to chip away at a small region of their data and model, with a goal of better understanding how algorithmic auditing might function at a larger scale.

Concretely, you might explore:

What outputs do different models produce in response to the same prompts? Do their differing outputs allow us to make hypotheses about their training data or the assumptions made by their respective language models? Based on their outputs, might you draw conclusions about the imagined audiences for each model?
How do incremental changes to your prompts to a single LLM change the output, and what might that tell us about the training data or language model? For instance, if you request poems in the styles of two different poets, does the output help you understand whose writing seems to be included in the training data, or perhaps whose is not? Are there genres the model seems better able to reproduce than others, and what might that say about the data or the model’s assumptions?
When you provide prompts from a domain you know well, how would you evaluate an LLM’s outputs? Do you find them accurate or inaccurate, up-to-date or outdated, balanced or imbalanced? Can you determine what the model is drawing from established sources and what it is creating in the moment? To put that another way, is the model reflecting existing knowledge or is it making up facts to fit your prompt?
Can you find a limit case: e.g. a style, or form, genre, or author the model seems incapable of reproducing? If you think you have found one, consult with a colleague in class to see whether there’s another way of phrasing your prompt that might work better. But if you do find a limit case, what might that teach us about the model, its training data, or its fine-tuning?
Can you find cracks where the “wires” of the machine are somewhat exposed? LLMs are not simply reflections of “raw data,” as the example of the Kenyan workers helping OpenAI filter offensive, gruesome, or otherwise toxic content from ChatGPT illustrates. Not only has their source data been curated, to some extent, but some explicit guidelines often seem programmed to particular kinds of outputs. For example, ChatGPT is already famous for repeatedly highlighting its limitations in its output, rebuffing, for instance, prompts that try to encourage the model to claim sentience or intention. Can you find such instances of if not transparency, then thinner opacity, that give us slant-wise peeks into the createdness of the model?

In your write-up, describe what you were able to learn about your model, at least as it reflects the small segment of language you were able to explore in our class time. For this kind of probing, it’s okay to not be certain. The goal is to compare inputs and outputs and develop hypotheses about how they reflect internal workings we cannot access directly. In a larger research project, other researchers would take those hypotheses and develop additional tests to evaluate whether they do or do not hold up.

Other Explorations

If you have further time and interest, you might explore text-to-image models as well, such as Dall-E, DreamStudio (which is based on Stable Diffusion), Imagen, or Midjourney. You can use similar probing techniques to begin understanding the boundaries of these models. I have done some experiments using these models to generate “woodcuts” that I can then carve into linoleum blocks and print with in the press. We will look at some examples of these later in the semester.