Concepts from Operating Systems That Found Their Way in LLMs – Koding Harness: notes on agent-first engineering

This IBM punched card, an early form of data encoding, reminds us that at the core, computers don’t see strings or words but numerical values. A nod to the origins, it’s the OG tokenizer that set the stage for today’s intricate LLMs Image by Author

Introduction

Diving into the intricacies of technology often uncovers unexpected parallels. Recently, I’ve been struck by how foundational computer operating system concepts are making waves in the domain of Large Language Models (LLMs), especially the transformer variants. I will try to spotlight two concepts that I observed so far: Branch Prediction and Virtual Memory Paging.

Branch Prediction in CPUs

Traditional Use in CPUs

Branch prediction is a tool used in computers to speed up how they work. When a computer has to choose between two sets of instructions, it tries to guess which set will be used next. Here’s a breakdown:

Prediction: The computer makes a guess about the next set of instructions.
Speculative Sampling: Based on that guess, the computer starts working on those instructions, even if it’s not sure it’s the right choice.
Correction: If the guess is right, the computer keeps going. If it’s wrong, the computer starts over and chooses the other set.

How LLMs Use This Idea

Google Deepmind used similar idea to make LLMs faster in Accelerating Large Language Model Decoding with Speculative Sampling. Their algorithm uses a smaller draft model to make initial guesses and a larger primary model to validate them. If the draft often guesses right, operations become faster, reducing latency.

There are some people speculating that OpenAI might be using speculative decoding on GPT-4’s inference. If you’re interested in a visual explanation, an insightful visualization on speculative decoding/assisted generation is worth checking out.

Moreover, there’s an intriguing recent paper, Online Speculative Decoding (OSD), that takes this concept to another level. The essence of OSD is to continually refine (multiple) draft model(s) based on observed user query data, leveraging the spare computational resources in LLM serving clusters. This approach is especially useful when the draft model is not as accurate as the target model, as it can be continually refined to improve its predictive accuracy.

Potential Security Vulnerabilities in LLMs

As neural network models, especially Large Language Models (LLMs), become pivotal in various applications, understanding and addressing their security vulnerabilities is crucial.

Stealing the Decoding Algorithms of Language Models

A key component of generating text from modern LLMs is the selection and tuning of decoding algorithms. These algorithms determine how to generate text from the internal probability distribution generated by the LM. The process of choosing a decoding algorithm and tuning its hyperparameters takes significant time, manual effort, and computation, and it also requires extensive human evaluation. Therefore, the identity and hyperparameters of such decoding algorithms are considered to be extremely valuable to their owners.

In this work, authors show, that an adversary with typical API access to an LLM can steal the type and hyperparameters of its decoding algorithms at very low monetary costs.

Spectre and Its Implications for LLMs

Spectre is one of the key transient execution CPU vulnerabilities, which involves timing side-channel attacks affecting modern microprocessors. The speculative execution in these processors might expose private data through observable side effects. While LLMs operate based on speculative principles similar in concept, it’s important to note that their speculative nature is fundamentally different. However, given that neural networks can be susceptible to timing side-channel attacks especially when LLMs interface with runtime environments capable of executing code, there is a potential for such attacks to be exploited.

The Risk of LLMs Interacting with Code Interpreters

A significant concern arises when LLMs have access to environments capable of running code, such as the Code Interpreter. In such scenarios, vulnerabilities could be exploited to make the LLMs run malicious code, posing even more significant security threats. It’s important to exercise caution and ensure secure barriers when deploying LLMs in such settings.

Safeguarding Model Weights

Weights of machine learning models are often stored in formats, such as pickle, that are vulnerable to security breaches. An alternative to consider is “Safetensors” - a more secure format ensuring the safety of tensor data. Not only is it a secure choice, but Safetensors also has impressive speed.

LLMs: Deep Neural Networks Behind the Scenes

It’s important to remember that LLMs are deep neural networks (DNNs) at their core. This association brings inherent security concerns, as highlighted by these research papers:

Stealing Neural Networks via Timing Side Channels: This paper underscores the susceptibility of Neural Networks to timing side channel attacks and proposes a black-box Neural Network extraction technique.
Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks: This research emphasizes the opaque nature of DNNs, making them prone to backdoor attacks. It introduces methods to detect and counter such hidden threats.

Learning from the Past: Avoiding the Mistakes of Operating Systems

As we explore more about Large Language Models (LLMs) and how they can be used, it’s clear that they share some similarities with operating systems. Just like operating systems had certain issues and vulnerabilities, LLMs could have them too. It’s important for us to learn from those past mistakes to make sure we don’t run into similar problems, like the security issues seen with Spectre, a vulnerability found in many modern microprocessors.

Virtual Memory Paging

Traditional Use in Computers

Virtual memory is an abstraction provided by the operating system that makes it seem to an application as if it has access to more RAM than is physically available. When the actual RAM gets filled up, the operating system uses a portion of the computer’s storage space (typically the hard drive) as an extension of RAM. This process enables the computer to handle more tasks concurrently by mapping the application’s memory addresses to actual physical locations, which could be in RAM, on the hard disk, or even other storage mediums.

In essence, virtual memory gives applications the illusion they’re utilizing a large, contiguous chunk of RAM, even though the reality behind the scenes might be quite different.

How LLMs Use This Idea

Transformers, especially LLMs, feature a mechanism called “KV cache,” similar to RAM, that temporarily stores key-value pairs during attention operations for quick access. To efficiently handle longer sequences that don’t fit in memory, they could potentially adopt techniques inspired by virtual memory paging.

vLLM: virtual paging for KV cache

Researchers from UC Berkeley introduced this idea in a study called Efficient Memory Management for Large Language Model Serving with PagedAttention also dubbed as vLLM.

The heart of vLLM is PagedAttention. It’s a fresh take on how attention works in transformers, borrowing from the paging idea in computer OS. Remarkably, without changing the original model, PagedAttention allows batching up to 5x more sequences. This means better use of GPU resources and faster operations.

Also here’s a rapid breakdown of some crucial state of the art LLM serving techniques as of Oct 2023:

Continuous Batching: Increases throughput by allowing requests to immediately jump onto an ongoing GPU batch, minimizing wait time.
PagedAttention: Much like OS’s virtual paging but tailored for KV cache in LLMs, allowing 3x more simultaneous requests and thereby tripling throughput.
Speculative Decoding: Uses a smaller draft model to make initial guesses and a larger primary model to validate them. If the draft often guesses right, operations become faster, reducing latency, like we described in the previous section.

MemGPT: “Virtually” Extending LLM’s Limited Context Windows

MemGPT: Towards LLMs as Operating Systems also from UC Berkley is a new way to help LLMs like GPT-4 remember more information. Think of it as adding an extra brain to the LLM. This extra brain has two parts:

Internal Memory (LLM RAM): A small space where the LLM keeps important information.
External Memory (LLM HDD): A much larger space where the LLM can store and retrieve data when needed.

When the LLM needs data from the external memory, it breaks it into smaller pieces that fit into the internal memory. This lets LLMs handle big tasks that need lots of information.

MemGPT makes it easier to use LLMs for tasks that need a lot of memory. With this tool, we don’t have to worry about the LLM running out of space.

In the realm of LLMs, context and memory are kind of like the foundational, “RAM” of our era. Andrej Karpathy has already made this comparison:

The analogy between GPTs of today to the CPUs of early days of computing are interesting. GPT is a funny kind of programmable text computer. Have to think through it more 🤔 but e.g.:

## Memory
GPT-4 RAM is ~log2(50K vocab size)*(32K context length)/(8 bits/byte) ~= 64kB,…
— Andrej Karpathy (@karpathy) April 7, 2023

Conclusion

Ideas and strategies often flow between different fields of tech, leading to innovations. In this case, traditional computer systems concepts are helping to improve transformer-based LLMs. This was a brief share of my learnings, and I genuinely invite and appreciate feedback, insights, and further discussions on this topic.

Top recommendation

If you want to read more comperhensive write-up on this topic, I totally recommend, (At the Intersection of LLMs and Kernels - Research Roundup) [https://charlesfrye.github.io/programming/2023/11/10/llms-systems.html] by Charles Frye