Running LLama2 locally on a Mac – chrwittm.github.io

Running a large language models (LLM), namely llama2, locally on my Mac was the next logical step for me working through the hacker’s guide by Jeremy Howard. While it was possible to adjust Jeremy’s approach on Hugging Face to also work on Apple Silicon, I focussed on llama.cpp and its python binding llama-cpp-python to talk to llama2.

The whole journey consisted of the following steps, and I am going to take you though all of them to share my learnings along the way:

Getting access to / downloading llama2
Running Llama2 via Hugging Face and understanding why this is not a good approach on a Mac
Running Llama2 via llama.cpp/llama-cpp-python and understanding why this approach works a lot better on a Mac

Getting Access to Llama2

First things first: Before, you can access the Llama2 model, you need to agree to Meta’s the terms and conditions for Llama2. As per the time of writing this, the process was as follows:

Visit the model’s home page at Hugging Face
Go to Meta’s website, and complete the registration form
Confirm the terms and conditions on the Hugging Face Website (see screenshot)

The approval only took a couple of minutes.

Running Llama2 via Hugging Face

Trying to stick as closely as possible to the original hacker’s guide, I wanted to run LLama2 locally on my Mac using the Hugging Face API, just to see if it worked. Without Nvidia support, I needed to adapt the code to make it compatible with Apple’s Metal Framework. For all all the details, what needed to be done to run llama2 via the Hugging Face API, please check out this notebook.

The final result was academically interesting, but performance left much to be desired 😉: What Jeremy’s machine did in 2 seconds took my MacBook more than 3 minutes. There are probably a couple of reasons which produced this dramatic difference in performance:

Nvidia memory throughput is a lot better then Apple’s unified RAM
The model I used was originally optimized and quantized for Nvidia GPUs. To run this model on my MacBook, I had to disable the 8-bit quantization (load_in_8bit=False) among other changes. While this adaptation was necessary for compatibility with Apple Silicon, it discarded all the optimizations.
PyTorch’s optimization for CUDA is probably still way better than its MPS optimization.

Here is a key learning: Running large language models (LLMs) locally requires more than brute force. Instead, hardware and software need to be aligned. Apple Silicon machines are extremely capable, but they need a different kind of optimization then Nvidia hardware. Consider the following analogy: Imagine you need to travel from Hamburg to Munich, and you have 2 hardware setups available, a car (let’s say this represents Nvidia hardware) or a plane (let’s say this represents Apple Silicon). Both these hardware setups require different optimizations to get from A to B.

Driving from Hamburg to Munich by car (representing Nvidia hardware), you optimize your path along the roads. If you used the plane instead (representing Apple Silicon), the same optimization would not work well. Attempting to navigate the plane on the roads, as you would a car, is highly impractical. Therefore, you would use a different way to optimize the path: You take public transport or a taxi to the airport, you fly from Hamburg to Munich, and again, you take public transport or a taxi to reach your final destination. On both hardware setups you have reached your Munich, but the underlying setup and optimizations differed significantly.

Therefore, let’s hop on the plane, and let’s explore a different way to run llama2 to on a Mac: Let’s turn our attention to llama.cpp.

Dalle: A llama driving a car and another llama flying a plane on the road from Hamburg to Munich

What is `llama.cpp`?

Llama.cpp is an optimized library to run a large language model (LLM) like Llama2 on a Mac, but it also supports other platforms. How is this possible? For the details, please let me refer to this tweet by Andrej Karpathy and for even more details to this blog post by Finbarr Timbers. Here are my takeaways:

Llama.cpp runs inference of LLMs in pure C/C++, therefore, it is significantly faster than implementations in higher languages like python.
Additionally, the mission of the project “is to run the LLaMA model using 4-bit integer quantization on a MacBook”. This means that numbers used to represent model weights and activations downsized from 32- or 16- bit floating points (the format of the base models) with 4-bit integers. This reduces memory usage and improves the performance and efficiency of the model during inference. The somewhat surprising thing is that model performance does not degrade by this downsizing.

When I mentioned before that I had to turn off quantization on Hugging Face, here we turn it on a again, just differently.

How You Can Use llama.cpp from Python

The project llama-cpp-python serves as a binding for llama.cpp, providing access to the C++ API to Llama2 from Python.

In this context, a “binding” is a bridge that facilitates interaction between two programming languages, i.e. a layer of code that allows two programming languages to interact with each other. Llama.cpp is written in C/C++, and the llama-cpp-python binding allows this C/C++ library to be utilized within a Python environment. Essentially, the Python code wraps around the C/C++ code so that it can be called from a Python environment.

While it might sound complicated, the concept is surprisingly accessible when you reduce the context to a simple example. To keep the focus in this blog post, I separated the exploration of C bindings into this blog post (LINK).

Installing `llama-cpp-python`

First, we need to install llama-cpp-python via pip install llama-cpp-python.

Upgrading is done via pip install llama-cpp-python --upgrade --force-reinstall --no-cache-dir.

💡 Note: To execute the steps interactively, please check out my related notebook.

Downloading the Model

For all my experiments, I used the following model: TheBloke/Llama-2-7b-Chat-GGUF

To download the model, please please run the code below, assuming that you have stored your Hugging Face access token in the .env-file. For additional insights/troubleshooting, please also check out my previous blog post / my previous notebook:

from dotenv import load_dotenv
import os

load_dotenv()
token = os.getenv('HF_TOKEN')
os.environ['HF_TOKEN'] = token
!huggingface-cli login --token $HF_TOKEN
!wget -P ../models https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf

Loading the Model

Loading the model, only required 2 lines of code. Let’s talk about the parameters:

n_ctx=2048: This sets the context window to 2048 tokens. The maximum number of tokens for this model is 4096.
verbose=False: This makes the model less talkative. It only prints the actual results when prompted. Please feel free try turning it to True to get additional information from the model, not just the generated response.

from llama_cpp import Llama
llm = Llama(model_path="../models/Llama-2-7b-chat/llama-2-7b-chat.Q4_K_M.gguf", n_ctx=2048, verbose=False)
#llm = Llama(model_path="../../../lm-hackers/models/llama-2-7b-chat.Q4_K_M.gguf", n_ctx=2048, verbose=False)

Completion vs. Chat Completion Example

There are 2 ways we can talk to the LLM: The completion method literally does what it promises: It completes a prompt. For having a conversation with the LLM, we need to use Chat Completion.

As per the Getting Started guide, let’s look at one example each on how to use the API:

Let’s do text completion first.

output = llm("Q: Name the planets in the solar system? A: ", max_tokens=128, stop=["Q:", "\n"], echo=True)
print(output['choices'][0]['text'])

Q: Name the planets in the solar system? A: 1. Mercury 2. Venus 3. Earth 4. Mars 5. Jupiter 6. Saturn 7. Uranus 8. Neptune

For the chat completion, let’s re-write the code to reproduce the example from the hackers guide to make the LLM talk about money in aussie slang.

aussie_sys = "You are an Aussie LLM that uses Aussie slang and analogies whenever possible."

messages=[
    {"role": "system", "content": aussie_sys},
    {"role": "user", "content": "What is money?"}]

model_response = llm.create_chat_completion(messages = messages, stream=False)
print(model_response['choices'][0]['message']['content'])

  Fair dinkum, mate! Money, eh? It's like the oxygen we breathe, ya know? (laughs) Just kiddin', but seriously, money is like the lifeblood of society. It's what keeps the economy tickin' over and allows us to buy the things we need and want.
Think of it like this: money is like a big ol' pile of dough (get it? Dough? Like bread dough? Ah, never mind). We all gotta work hard to earn that dough, whether it's through our day job or by startin' our own business. And then we can use that dough to buy things like food, shelter, and a cold one at the pub after work.
But here's the thing: money ain't everything, mate. There's more to life than just makin' dough. We gotta find meaning and purpose in our lives, or else we'll be livin' like a dog (sorry, dogs!). So, while money's important, it's not the only thing that matters.
Now, I know some blokes might say, "Money, money, money! That's all that matters!" But let me tell you, mate, they're barkin' up the wrong tree (get it? Barkin' up the wrong tree? Ah, never mind). There's more to life than just chasin' after the green.
So there you have it, mate! Money's like a big ol' pile of dough that we all gotta work hard to earn. But don't forget, there's more to life than just makin' dough. Keep on keepin' on, and always remember: money may not buy happiness, but it can buy a cold one at the pub after work! (laughs)

Conclusion

Using the right approach, running an LLM, in this case llama2, in a Jupyter notebook on a Mac is not really difficult. Once you sorted out the setup (like terms and conditions), starting up llama2 via llama-cpp-python only requires a few lines of code. Happy chatting!

Dalle: Aussie LLM depicted as kangaroo talking about money with Aussie slang and analogies

Reuse

CC BY 4.0