LLM as compression algorithms

Back when I was studying computer science, one of the interesting bits was the discussion of the nformation content in a message which is distinct to the actual number of bits used to transmit the same message. I can remember a definition which involved the sum of logarithms of long-term occurrences versus the transmitted messages. The upshot was, that only if 0s and 1s are equally distributed, then each Bit contains one bit worth of information.

The next iteration was compressibility: if there are patterns in the message, then a compression algorithm can reduce the number of bits needed to store the full message, thus the information content in original text does not equal its number of bits. This could be a simple Huffman encoding, or more advanced algorithms like Lempel-Ziv-Welch, but one of the main points here is that the algorithm is completely content agnostic. There are no databases of English words inside these compressors; they cannot substitute numerical IDs of word for the words themselves. That would be considered cheating in the generic compression game. (There are, of course, some instances of very domain-specific compression algorithms which do build on knowledge of the data likely to be transmitted. HTTP/2 or SIP header-compression are such examples.)

Another interesting progress was the introduction of lossy compression. For certain applications (e.g., images, sounds, videos) it is not necessary to be able to reproduce the original file bit by bit, but only to generate something that looks or sounds very similar to the original media. This unlocked a huge potential for efficient compression. JPEG for images, MPEG3 for music and DIVX for movies reached the broad population by shrinking these files to manageable sizes. They made digital mixtapes (i.e., self-burned audio CDs) possible, CD-ROMs with pirated movies were traded in school yards and Napster started the online file-sharing revolution.

Now we have the LLMs, the large language models which are an implementation of generative AI: Algorithms, combined with a large memory derived by processing huge amounts of content, can now transform texts, images, sounds and even videos into each other. They can act as compressors: you can feed text into an LLM and ask for a summary, but you can also ask it to expand an argument from a few bullet points into a short essay. The inner state of the LLM while it performs these actions kind of represents the essence of the content it is processing. The output format is independent of this state: in the simplest case, you can specify whether the output should be in German or in English, additionally, you can ask for different styles: write for children, write dry legal prose, be witty or even write the content as a poem. Translating from one medium to another is also possible: the AI can look at a picture and generate a textual description of the image, or vice-versa, it can create a picture out of a written content summary.

I’m pretty sure the following scenario has already happened: An employee is asked to write a report on a certain subject: he thinks about the topic, comes up with a few ideas which he writes down as a list of bullet points. These are handed to an LLM with an appropriate prompt to generate a nice 5-page report detailing these points. The AI obliges and the resulting 5-pager is handed to the boss. Being short on time, he doesn’t want to read five pages, so he asks an LLM to summarize the paper to give him the core message in list of short statements. Ideally, the second LLM reproduces the same bullet point which the employee originally came up with, making the whole exercise a complete waste of computation resources.

There are two points in this story which are important to note:

First, if we are liberal with the concept of “lossy compression”, then the specific formulation of an idea in a language doesn’t really matter in terms of information content. If you give an LLM the same prompt time and time again, you will get different results each time. If, for example, you ask for a Limerick about a horse in a bar, you will get different ones almost every time. But on a more abstract level, they are all embodiments of the same concept: a Limerick about a horse in a bar. The same applies to a switch in languages: if you ask the LLM to change the output from German to English, the result will change substantially. But again: if you just look at the abstract ideas embodied in the text, the language it is written in just does not matter.

The bible in Greek, English, or German might have very few words in common, but the content is the same. This is just like converting a picture from GIF to JPEG: The bits in the file have completely changed but given the right parsers they produce the same information content, with only some fuzziness in details caused by the jpeg compression.

Secondly, when processing a prompt or analysing a text/image/sound, the LLM produces an activation pattern in its high-dimensional set of parameters that form the scaffolding of its memory, transforming the input into something that one might call its “state of mind”. This is the LLM-internal representation of the input, abstracting away the unimportant bits of incoming information and retaining the meaning. This internal state is opaque to us, we have little information which parameter corresponds to exactly what concept. I also don’t know the size in Bytes that this representation needs.

Now comes the “generative” part of the AI: the combination of the state, the learned connections between the concept and the prompt enable the LLM to transform this opaque state of mind into an output that humans can understand. The output can be short, e.g., if the prompt asks for a short, written summary, or longer, if the target format is an essay. Coming back the example from above: the LLM does not iteratively compress a longer text into a summary by analysing individual sentences, instead it speedreads everything into something like short-term memory and then dumps out the highlights it found.

If a short prompt can produce the same activation pattern as a long input text, then the information content is the same. This only works because the LLM has this huge storage of knowledge it can reference – something we said in the beginning that classic compression algorithms cannot utilize. So, as an example, the input “lyrics of the Beatles’ song Yesterday” and the actual lyrics as two dozen lines of text convey the same information to the LLM. This enables truly enormous compression rates.

To summarize, it might be a helpful abstraction to view LLMs as lossy compression/de-compression machines that can utilize an enormous pool of knowledge to make the process much more efficient, as long as you accept the fact that this a very lossy compression which only preserves the core concepts contained in the input but is free to change the representation of this information content. And, of course, it is prone to make wrong associations and hallucinate content.