from dotenv import load_dotenv
import os
load_dotenv()= os.getenv('HF_TOKEN')
token 'HF_TOKEN'] = token
os.environ[!huggingface-cli login --token $HF_TOKEN
!wget -P ../models https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf
Running a large language models (LLM), namely llama2, locally on my Mac was the next logical step for me working through the hacker’s guide by Jeremy Howard. While it was possible to adjust Jeremy’s approach on Hugging Face to also work on Apple Silicon, I focussed on llama.cpp
and its python binding llama-cpp-python
to talk to llama2.
The whole journey consisted of the following steps, and I am going to take you though all of them to share my learnings along the way:
- Getting access to / downloading llama2
- Running Llama2 via Hugging Face and understanding why this is not a good approach on a Mac
- Running Llama2 via
llama.cpp
/llama-cpp-python
and understanding why this approach works a lot better on a Mac
Getting Access to Llama2
First things first: Before, you can access the Llama2 model, you need to agree to Meta’s the terms and conditions for Llama2. As per the time of writing this, the process was as follows:
- Visit the model’s home page at Hugging Face
- Go to Meta’s website, and complete the registration form
- Confirm the terms and conditions on the Hugging Face Website (see screenshot)
The approval only took a couple of minutes.
Running Llama2 via Hugging Face
Trying to stick as closely as possible to the original hacker’s guide, I wanted to run LLama2 locally on my Mac using the Hugging Face API, just to see if it worked. Without Nvidia support, I needed to adapt the code to make it compatible with Apple’s Metal Framework. For all all the details, what needed to be done to run llama2 via the Hugging Face API, please check out this notebook.
The final result was academically interesting, but performance left much to be desired 😉: What Jeremy’s machine did in 2 seconds took my MacBook more than 3 minutes. There are probably a couple of reasons which produced this dramatic difference in performance:
- Nvidia memory throughput is a lot better then Apple’s unified RAM
- The model I used was originally optimized and quantized for Nvidia GPUs. To run this model on my MacBook, I had to disable the 8-bit quantization (
load_in_8bit=False
) among other changes. While this adaptation was necessary for compatibility with Apple Silicon, it discarded all the optimizations. - PyTorch’s optimization for CUDA is probably still way better than its MPS optimization.
Here is a key learning: Running large language models (LLMs) locally requires more than brute force. Instead, hardware and software need to be aligned. Apple Silicon machines are extremely capable, but they need a different kind of optimization then Nvidia hardware. Consider the following analogy: Imagine you need to travel from Hamburg to Munich, and you have 2 hardware setups available, a car (let’s say this represents Nvidia hardware) or a plane (let’s say this represents Apple Silicon). Both these hardware setups require different optimizations to get from A to B.
Driving from Hamburg to Munich by car (representing Nvidia hardware), you optimize your path along the roads. If you used the plane instead (representing Apple Silicon), the same optimization would not work well. Attempting to navigate the plane on the roads, as you would a car, is highly impractical. Therefore, you would use a different way to optimize the path: You take public transport or a taxi to the airport, you fly from Hamburg to Munich, and again, you take public transport or a taxi to reach your final destination. On both hardware setups you have reached your Munich, but the underlying setup and optimizations differed significantly.
Therefore, let’s hop on the plane, and let’s explore a different way to run llama2 to on a Mac: Let’s turn our attention to llama.cpp.
What is llama.cpp
?
Llama.cpp
is an optimized library to run a large language model (LLM) like Llama2 on a Mac, but it also supports other platforms. How is this possible? For the details, please let me refer to this tweet by Andrej Karpathy and for even more details to this blog post by Finbarr Timbers. Here are my takeaways:
Llama.cpp
runs inference of LLMs in pure C/C++, therefore, it is significantly faster than implementations in higher languages like python.- Additionally, the mission of the project “is to run the LLaMA model using 4-bit integer quantization on a MacBook”. This means that numbers used to represent model weights and activations downsized from 32- or 16- bit floating points (the format of the base models) with 4-bit integers. This reduces memory usage and improves the performance and efficiency of the model during inference. The somewhat surprising thing is that model performance does not degrade by this downsizing.
When I mentioned before that I had to turn off quantization on Hugging Face, here we turn it on a again, just differently.
How You Can Use llama.cpp from Python
The project llama-cpp-python
serves as a binding for llama.cpp
, providing access to the C++ API to Llama2 from Python.
In this context, a “binding” is a bridge that facilitates interaction between two programming languages, i.e. a layer of code that allows two programming languages to interact with each other. Llama.cpp
is written in C/C++, and the llama-cpp-python
binding allows this C/C++ library to be utilized within a Python environment. Essentially, the Python code wraps around the C/C++ code so that it can be called from a Python environment.
While it might sound complicated, the concept is surprisingly accessible when you reduce the context to a simple example. To keep the focus in this blog post, I separated the exploration of C bindings into this blog post (LINK).
Installing llama-cpp-python
First, we need to install llama-cpp-python
via pip install llama-cpp-python
.
Upgrading is done via pip install llama-cpp-python --upgrade --force-reinstall --no-cache-dir
.
💡 Note: To execute the steps interactively, please check out my related notebook.
Downloading the Model
For all my experiments, I used the following model: TheBloke/Llama-2-7b-Chat-GGUF
To download the model, please please run the code below, assuming that you have stored your Hugging Face access token in the .env
-file. For additional insights/troubleshooting, please also check out my previous blog post / my previous notebook:
Loading the Model
Loading the model, only required 2 lines of code. Let’s talk about the parameters:
n_ctx=2048
: This sets the context window to 2048 tokens. The maximum number of tokens for this model is 4096.verbose=False
: This makes the model less talkative. It only prints the actual results when prompted. Please feel free try turning it toTrue
to get additional information from the model, not just the generated response.
from llama_cpp import Llama
= Llama(model_path="../models/Llama-2-7b-chat/llama-2-7b-chat.Q4_K_M.gguf", n_ctx=2048, verbose=False)
llm #llm = Llama(model_path="../../../lm-hackers/models/llama-2-7b-chat.Q4_K_M.gguf", n_ctx=2048, verbose=False)
Completion vs. Chat Completion Example
There are 2 ways we can talk to the LLM: The completion method literally does what it promises: It completes a prompt. For having a conversation with the LLM, we need to use Chat Completion.
As per the Getting Started guide, let’s look at one example each on how to use the API:
Let’s do text completion first.
= llm("Q: Name the planets in the solar system? A: ", max_tokens=128, stop=["Q:", "\n"], echo=True)
output print(output['choices'][0]['text'])
Q: Name the planets in the solar system? A: 1. Mercury 2. Venus 3. Earth 4. Mars 5. Jupiter 6. Saturn 7. Uranus 8. Neptune
For the chat completion, let’s re-write the code to reproduce the example from the hackers guide to make the LLM talk about money in aussie slang.
= "You are an Aussie LLM that uses Aussie slang and analogies whenever possible."
aussie_sys
=[
messages"role": "system", "content": aussie_sys},
{"role": "user", "content": "What is money?"}]
{
= llm.create_chat_completion(messages = messages, stream=False)
model_response print(model_response['choices'][0]['message']['content'])
Fair dinkum, mate! Money, eh? It's like the oxygen we breathe, ya know? (laughs) Just kiddin', but seriously, money is like the lifeblood of society. It's what keeps the economy tickin' over and allows us to buy the things we need and want.
Think of it like this: money is like a big ol' pile of dough (get it? Dough? Like bread dough? Ah, never mind). We all gotta work hard to earn that dough, whether it's through our day job or by startin' our own business. And then we can use that dough to buy things like food, shelter, and a cold one at the pub after work.
But here's the thing: money ain't everything, mate. There's more to life than just makin' dough. We gotta find meaning and purpose in our lives, or else we'll be livin' like a dog (sorry, dogs!). So, while money's important, it's not the only thing that matters.
Now, I know some blokes might say, "Money, money, money! That's all that matters!" But let me tell you, mate, they're barkin' up the wrong tree (get it? Barkin' up the wrong tree? Ah, never mind). There's more to life than just chasin' after the green.
So there you have it, mate! Money's like a big ol' pile of dough that we all gotta work hard to earn. But don't forget, there's more to life than just makin' dough. Keep on keepin' on, and always remember: money may not buy happiness, but it can buy a cold one at the pub after work! (laughs)
Conclusion
Using the right approach, running an LLM, in this case llama2, in a Jupyter notebook on a Mac is not really difficult. Once you sorted out the setup (like terms and conditions), starting up llama2 via llama-cpp-python
only requires a few lines of code. Happy chatting!