Running with transformers
If you want an even simpler setup without any coding and only want an AI assistant in your computer, go to:
My Grandma can follow this tutorial to use LLama with a single hand
Model Download
Hugging face version, coupled with a large community and more tutorials, is more fine-tuned and easier to use. If you want the original version, go to the repository and run the download.sh script.
If you are blocked from these two methods:
huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct --exclude "original/*" --local-dir Meta-Llama-3-8B-Instruct
git clone --progress https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct
Download the model directly from your browser. I use IDM and the speed is not slow for me.
Conda Install
You can also do this in venv. Pay attention to the install channel.
Use the latest version, if not, update the conda and pip.
huggingface_hub is tested on Python 3.8+.
conda install -c conda-forge huggingface_hub
conda install conda-forge::transformers
conda install -c conda-forge accelerate
pip install bitsandbytes
Also, bitsandbytes may not compile with your CUDA, please check your version. Mine is CUDA 11.7 on Linux server with the latest version of bitsandbytes.
The libraries are updating from time to time, so you may probably encounter some version compilation error.
You can also use pipeline to implement the model, which is a high-level approach compared to transformers. However, transformers provides more flexibility and you can customize some functions when implementing the model.
Note: this script specifies the code to run on only 1 GPU, because 8B-instruct is actually not a fairly big model and parallel computation on multiple GPUs even slows the inference down.(2s to 19s in my case)
LLama3 keeps talking to itself until the max token is reached
Go to tokenizer_config.json.
Locate around line 2055. Change the code to "eos_token": "<|eot_id|>" and be sure to double-check your template. My template somehow is missing here. Here is my modified template:
"bos_token": "<|begin_of_text|>",
"clean_up_tokenization_spaces": true,
"chat_template": "{% set loop_messages = messages %}
{% for message in loop_messages %}
{% set content = '<|start_header_id|>' + message['role'] + ': ' +'<|end_header_id|>\n\n'+ message['content'] | trim + '\n' + '<|eot_id|>' %}
{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}
content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>LLama3{% endif %}",
"eos_token": "<|eot_id|>",
"model_input_names": [
"input_ids",
"attention_mask"
],
"model_max_length": 1000000000000000019884624838656,
"tokenizer_class": "PreTrainedTokenizerFast"
To use bitsandbyte, you must pass an object to configure the quantization settings.
Here is my code:
import transformers
import time
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
import ipdb
import os
# os.environ["CUDA_VISIBLE_DEVICES"] = "1"
#NF4 (normalized float 4 (default))
# you can customize you quantization settings
nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16
)
# local file path to your model
model_path = "llama3-8b-instruct"
messages = [
{"role": "system", "content": "You are a knowledgeable AI assistant"},
{
"role": "user", "content": "Give me three things that people rarely know."}
]
#set up tokenizer for your model
tokenizer = AutoTokenizer.from_pretrained(model_path)
#load model
model = AutoModelForCausalLM.from_pretrained(model_path, quantization_config=nf4_config, device_map="auto")
#apllying chat template
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
#add terminators to your tokenizer
terminators = [
tokenizer.eos_token_id,
tokenizer.convert_tokens_to_ids("<|eot_id|>"),
tokenizer.convert_tokens_to_ids("<|end_of_text|>")
]
t1 = time.time()
# inference
outputs = model.generate(
input_ids,
max_new_tokens=256,
pad_token_id=tokenizer.eos_token_id,
eos_token_id=terminators,
do_sample=True,
temperature=0.6,
top_p=0.9,
)
print("-------------------------------------------")
print("infer time " , time.time() - t1)
response = outputs[0]
print(tokenizer.decode(response, skip_special_tokens=True))
The response ends after the first turn is finished.
If something unexpected happens, e.g LLama keeps repeating a sentence, try to comment pad_token_id=tokenizer.eos_token_id
It intends to suppress this warning, and unfortunately, I do not know what happens here explicitly.
updated - solving repetition issue
Originally from this paper. I was informed by one of my intelligent colleagues.
Adding repetition penalty to the config.json file:
{
"architectures": [
"LlamaForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 128000,
"repetition_penalty": 1.2,
"eos_token_id": 128009,
#parameters continue...
This could, to some extent, alleviate the repetition problem.
Update again, adding a penalty to 8B model just sucks. LLama generates garbled text. Please be prudent to use it. However, 70B works way better than the smaller one.
Chatbot
If you want to make it a chatbot, use a while loop to encapsulate the input message and the get_response. I will not implement it here, instead, I will put a brief structure(not real code):
def chatbot(self, system_instructions=""):
conversation = [{"role": "system", "content": system_instructions}]
while True:
user_input = input("User: ")
if user_input.lower() in ["exit", "quit"]:
print("Exiting the chatbot. Goodbye!")
break
response, conversation = self.get_response(user_input, conversation)
print(f"Assistant: {response}")
70B verses 8B
70B is way better. It is stable and fast, despite of relatively long loading time and high computational demand. I see someone supposed that 8B was nerfed by Meta. XD.
RAG enhanced
To be added in the future.
Reference
LLama3 was a very new model when this blog was written. I surfed numerous threads and tutorials and discovered that everyone was just as confused as myself. This blog is intended to remind me about the debugging process, which is not authoritative at all, please be responsible for finding the solution for your cases.
Something weird is that I didn't make the original LLama3 running on my 3070ti device, the example code does not work. Besides, I can barely find tutorials relating to the original version. If you want to try more examples, refer to the LLama3 recipe, which offers some useful explanations. Also, if you are in C/C++ you could try llama.cpp, for me only Python transformers is enough.
links;
https://huggingface.co/docs/transformers/installation
https://huggingface.co/docs/bitsandbytes/main/en/installation
https://huggingface.co/docs/accelerate/basic_tutorials/install
https://github.com/meta-llama/llama3/issues/104
https://medium.com/@manuelescobar-dev/implementing-and-running-llama-3-with-hugging-faces-transformers-library-40e9754d8c80
https://huggingface.co/docs/transformers/chat_templating
https://huggingface.co/blog/4bit-transformers-bitsandbytes
https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/commit/1bd790873a4163298eee9920db72439fff5815b1
This work is licensed under CC BY-NC 4.0