Categories
Internet

LLM as compression algorithms

Back when I was studying computer science, one of the interesting bits was the discussion of the information content in a message which is distinct to the actual number of bits used to transmit the same message. I can remember a definition which involved the sum of logarithms of long-term occurrences versus the transmitted messages. The upshot was, that only if 0s and 1s are equally distributed, then each Bit contains one bit worth of information.

The next iteration was compressibility: if there are patterns in the message, then a compression algorithm can reduce the number of bits needed to store the full message, thus the information content in original text does not equal its number of bits. This could be a simple Huffman encoding, or more advanced algorithms like Lempel-Ziv-Welch, but one of the main points here is that the algorithm is completely content agnostic. There are no databases of English words inside these compressors; they cannot substitute numerical IDs of word for the words themselves. That would be considered cheating in the generic compression game. (There are, of course, some instances of very domain-specific compression algorithms which do build on knowledge of the data likely to be transmitted. HTTP/2 or SIP header-compression are such examples.)

Another interesting progress was the introduction of lossy compression. For certain applications (e.g., images, sounds, videos) it is not necessary to be able to reproduce the original file bit by bit, but only to generate something that looks or sounds very similar to the original media. This unlocked a huge potential for efficient compression. JPEG for images, MPEG3 for music and DIVX for movies reached the broad population by shrinking these files to manageable sizes. They made digital mixtapes (i.e., self-burned audio CDs) possible, CD-ROMs with pirated movies were traded in school yards and Napster started the online file-sharing revolution.