Running with transformers

If you want an even simpler setup without any coding and only want an AI assistant in your computer, go to:
My Grandma can follow this tutorial to use LLama with a single hand


Model Download

Hugging face version, coupled with a large community and more tutorials, is more fine-tuned and easier to use. If you want the original version, go to the repository and run the download.sh script.
If you are blocked from these two methods:

huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct --exclude "original/*" --local-dir Meta-Llama-3-8B-Instruct

git clone --progress  https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct

Download the model directly from your browser. I use IDM and the speed is not slow for me.

Conda Install

You can also do this in venv. Pay attention to the install channel.
Use the latest version, if not, update the conda and pip.

huggingface_hub is tested on Python 3.8+.

conda install -c conda-forge huggingface_hub

conda install conda-forge::transformers

conda install -c conda-forge accelerate

pip install bitsandbytes

Also, bitsandbytes may not compile with your CUDA, please check your version. Mine is CUDA 11.7 on Linux server with the latest version of bitsandbytes.
The libraries are updating from time to time, so you may probably encounter some version compilation error.

You can also use pipeline to implement the model, which is a high-level approach compared to transformers. However, transformers provides more flexibility and you can customize some functions when implementing the model.

Note: this script specifies the code to run on only 1 GPU, because 8B-instruct is actually not a fairly big model and parallel computation on multiple GPUs even slows the inference down.(2s to 19s in my case)

LLama3 keeps talking to itself until the max token is reached

Go to tokenizer_config.json.
Locate around line 2055. Change the code to "eos_token": "<|eot_id|>" and be sure to double-check your template. My template somehow is missing here. Here is my modified template:

"bos_token": "<|begin_of_text|>",
  "clean_up_tokenization_spaces": true,
  "chat_template": "{% set loop_messages = messages %}
{% for message in loop_messages %}
{% set content = '<|start_header_id|>' + message['role'] + ': ' +'<|end_header_id|>\n\n'+ message['content'] | trim + '\n' + '<|eot_id|>' %}
{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}
 content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>LLama3(<|end_header_id|>\n\n' ){% endif %}",
  "eos_token": "<|eot_id|>",
  "model_input_names": [
    "input_ids",
    "attention_mask"
  ],
  "model_max_length": 1000000000000000019884624838656,
  "tokenizer_class": "PreTrainedTokenizerFast"

To use bitsandbyte, you must pass an object to configure the quantization settings.
Here is my code:

import transformers
import time
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
import ipdb
import os 

# os.environ["CUDA_VISIBLE_DEVICES"] = "1"

#NF4 (normalized float 4 (default))
# you can customize you quantization settings 
nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16
)
# local file path to your model
model_path = "llama3-8b-instruct"

messages = [
  {"role": "system", "content": "You are a knowledgeable AI assistant"},
  {
      "role": "user", "content": "Give me three things that people rarely know."}
]

#set up tokenizer for your model
tokenizer = AutoTokenizer.from_pretrained(model_path)

#load model
model = AutoModelForCausalLM.from_pretrained(model_path, quantization_config=nf4_config, device_map="auto")

#apllying chat template 
input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

#add terminators to your tokenizer
terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>"),
    tokenizer.convert_tokens_to_ids("<|end_of_text|>")
]

t1 = time.time()
# inference 
outputs = model.generate(
    input_ids,
    max_new_tokens=256,
    pad_token_id=tokenizer.eos_token_id,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
print("-------------------------------------------")
print("infer time " , time.time() - t1)
response = outputs[0]
print(tokenizer.decode(response, skip_special_tokens=True))

The response ends after the first turn is finished.
If something unexpected happens, e.g LLama keeps repeating a sentence, try to comment pad_token_id=tokenizer.eos_token_id
It intends to suppress this warning, and unfortunately, I do not know what happens here explicitly.

updated - solving repetition issue

Originally from this paper. I was informed by one of my intelligent colleagues.
Adding repetition penalty to the config.json file:

{
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,

  "repetition_penalty": 1.2,

  "eos_token_id": 128009,
  #parameters continue...

This could, to some extent, alleviate the repetition problem.

example1
example1
example2
example2

Update again, adding a penalty to 8B model just sucks. LLama generates garbled text. Please be prudent to use it. However, 70B works way better than the smaller one.

Chatbot

If you want to make it a chatbot, use a while loop to encapsulate the input message and the get_response. I will not implement it here, instead, I will put a brief structure(not real code):

def chatbot(self, system_instructions=""):
        conversation = [{"role": "system", "content": system_instructions}]
        while True:
            user_input = input("User: ")
            if user_input.lower() in ["exit", "quit"]:
                print("Exiting the chatbot. Goodbye!")
                break
            response, conversation = self.get_response(user_input, conversation)
            print(f"Assistant: {response}")

70B verses 8B

70B is way better. It is stable and fast, despite of relatively long loading time and high computational demand. I see someone supposed that 8B was nerfed by Meta. XD.

RAG enhanced

To be added in the future.

Reference

LLama3 was a very new model when this blog was written. I surfed numerous threads and tutorials and discovered that everyone was just as confused as myself. This blog is intended to remind me about the debugging process, which is not authoritative at all, please be responsible for finding the solution for your cases.
Something weird is that I didn't make the original LLama3 running on my 3070ti device, the example code does not work. Besides, I can barely find tutorials relating to the original version. If you want to try more examples, refer to the LLama3 recipe, which offers some useful explanations. Also, if you are in C/C++ you could try llama.cpp, for me only Python transformers is enough.

links;

https://huggingface.co/docs/transformers/installation
https://huggingface.co/docs/bitsandbytes/main/en/installation
https://huggingface.co/docs/accelerate/basic_tutorials/install
https://github.com/meta-llama/llama3/issues/104
https://medium.com/@manuelescobar-dev/implementing-and-running-llama-3-with-hugging-faces-transformers-library-40e9754d8c80
https://huggingface.co/docs/transformers/chat_templating
https://huggingface.co/blog/4bit-transformers-bitsandbytes
https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/commit/1bd790873a4163298eee9920db72439fff5815b1

This work is licensed under CC BY-NC 4.0