Wield The Power of LLMs On Intel® Arc™ GPUs

Easily Run a Variety of LLMs locally with Intel® Arc™ GPUs

Generative AI has changed the landscape of what’s possible in content creation. This technology has the potential to deliver previously unimagined images, videos and writing. Large language models (LLMs) have been making headlines in the age of AI, allowing anyone to prompt their way to generate song lyrics, get answers to complex physics questions, or draft an outline for a slide presentation. And these AI features no longer need to be connected to the cloud or subscription services. They can run locally on your own PC, where you have full control over the model to customize its output.

In this article, we’ll show you how to set up and experiment with popular large language models (LLMs) on a PC with the Intel® Arc™ A770 16GB graphics card. While this tutorial will make use of the Mistral-7B-Instruct LLM, these same steps can be used with a PyTorch LLM of your choosing such as Phi2, Llama2, etc. And yes, with the latest Llama3 model too!

IPEX-LLM

The reason we can run a variety of models using the same base installation is thanks to IPEX-LLM, an LLM library for PyTorch. It is built on top of Intel® Extension for PyTorch and contains state-of-art LLM optimizations and low-bit (INT4/FP4/INT8/FP8) weights compression – with all the latest performance optimizations for Intel hardware. IPEX-LLM takes advantage of the X^e-cores XMX AI acceleration on Intel discrete GPUs like Arc A-series graphics cards for improved performance. It supports Intel Arc A-series graphics on Windows Subsystem for Linux version 2, native Windows environments, and native Linux.

And because all of this is native PyTorch, you can easily swap PyTorch models and input data to run on an Intel Arc GPU with high performance acceleration. This experiment would not have been complete without a performance comparison. Using the instructions below for Intel Arc and commonly available instructions for the competition, we’ve looked at two discrete GPUs positioned in a similar price segment.

For example, when running the Mistral 7B model with the IPEX-LLM library, the Arc A770 16GB graphics card can process 70 tokens per second (TPS), or 70% more TPS than the GeForce RTX 4060 8GB using CUDA. What does that mean? A general rule of thumb is that 1 token is equivalent to 0.75 of a word and a good comparison is the average human reading speed of 4 words per second or 5.3 TPS. The Arc A770 16GB graphics card can generate words much faster than the average person can read them!

Our internal testing shows that the Arc A770 16GB graphics card can deliver this capability and competitive or leading performance across a wide range of models compared to the RTX 4060, making Intel Arc graphics a great choice for local LLM execution.

Now let’s go to set up instructions to get you started with LLMs on your Arc A-series GPU.

Installation Instructions

We can also refer to this page for setting up the environment: Install IPEX-LLM on Windows with Intel GPU — IPEX-LLM latest documentation

1. Disable integrated GPU in device manager.

2. Download and install Anaconda.

3. After installation is completed, open the Start menu, search for Anaconda Prompt, run it as administrator, and create a virtual environment using the following commands. Enter each command separately:

conda create -n llm python=3.10.6 

conda activate llm

conda install libuv

pip install dpcpp-cpp-rt==2024.0.2 mkl-dpcpp==2024.0.0 onednn==2024.0.0 gradio

pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/

pip install transformers==4.38.0

4. Create a text document named demo.py and save it to C:\Users\Your_Username\Documents or to the directory of your choice.

5. Open demo.py with your favorite editor and copy the following code sample into it:

from transformers import AutoTokenizer
from ipex_llm.transformers import AutoModelForCausalLM
import torch
import intel_extension_for_pytorch

device = "xpu" # the device to load the model onto

model_id = "mistralai/Mistral-7B-Instruct-v0.2" # huggingface model id

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True, torch_dtype=torch.float16)
model = model.to(device)

messages = [
    {"role": "user", "content": "What is your favourite condiment?"},
    {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
    {"role": "user", "content": "Do you have mayonnaise recipes?"}
]

encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")

model_inputs = encodeds.to(device)
model.to(device)

generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])

Code built from the sample code in this repository.

6. Save demo.py. In Anaconda, navigate to the directory that demo.py is located by using the cd command, and run the following command in the Anaconda prompt:

python demo.py

Now you can get a nice recipe for making mayonnaise!

Changing Models

Using the same environment that we set up above, you can experiment with other popular models on Hugging Face such as llama2-7B-chat-hf, llama3-8B-it, phi-2, gemma-7B-i, and stablelm2 by replacing the Hugging Face model id above in demo.py.

model_id = "mistralai/Mistral-7B-Instruct-v0.2" # huggingface model id

to

model_id = "stabilityai/stablelm-2-zephyr-1_6b" # huggingface model id

Different models might require a different version of transformers package, if you run into errors when launching demo.py, follow the below steps to upgrade/downgrade transformers:

Open Anaconda Prompt
conda activate llm
pip install transformers==4.37.0

Verified transformers versions:

Model ID	Transformers package versions
meta-llama/Llama-2-7b-chat-hf	4.37.0
meta-llama/Meta-Llama-3-8B-Instruct	4.37.0
stabilityai/stablelm-2-zephyr-1_6b	4.38.0
mistralai/Mistral-7B-Instruct-v0.2	4.38.0
microsoft/phi-2	4.38.0
google/gemma-7b-it	4.38.1
THUDM/chatglm3-6b	4.38.0

Memory requirements may vary per model and framework. For the Intel Arc A750 8GB running with IPEX-LLM, we recommend using Llama-2-7B-chat-hf, Mistral-7B-Instruct-v0.2, phi-2 or chatglm3-6B.

Implementing a ChatBot WebUI

Now let’s move onto implementing a Gradio chatbot webui for a better experience using your web browser. For more information on implementing an interactive chatbot with LLMs, visit https://www.gradio.app/guides/creating-a-chatbot-fast

1. Create a document named chatbot_gradio.py in your text editor of choice.

2. Copy and paste the following code snippet into chatbot_gradio.py:

import gradio as gr
import torch
import intel_extension_for_pytorch
from ipex_llm.transformers import AutoModelForCausalLM
from transformers import AutoTokenizer, StoppingCriteria, StoppingCriteriaList, TextIteratorStreamer
from threading import Thread

model_id = "mistralai/Mistral-7B-Instruct-v0.2" 

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True, optimize_model=True, load_in_4bit=True, torch_dtype=torch.float16)
model = model.half()
model = model.to("xpu")
class StopOnTokens(StoppingCriteria):
    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        stop_ids = [29, 0]
        for stop_id in stop_ids:
            if input_ids[0][-1] == stop_id:
                return True
        return False

def predict(message, history):
    stop = StopOnTokens()
    history_format = []
    for human, assistant in history:
        history_format.append({"role": "user", "content": human })
        history_format.append({"role": "assistant", "content":assistant})
    history_format.append({"role": "user", "content": message})

    prompt = tokenizer.apply_chat_template(history_format, tokenize=False, add_generation_prompt=True)
    model_inputs = tokenizer(prompt, return_tensors="pt").to("xpu")
    streamer = TextIteratorStreamer(tokenizer,  skip_prompt=True, skip_special_tokens=True)
    generate_kwargs = dict(
        model_inputs,
        streamer=streamer,
        max_new_tokens=300,
        do_sample=True,
        top_p=0.95,
        top_k=20,
        temperature=0.8,
        num_beams=1,
        pad_token_id=tokenizer.eos_token_id,
        stopping_criteria=StoppingCriteriaList([stop])
        )
    t = Thread(target=model.generate, kwargs=generate_kwargs)
    t.start()

    partial_message = ""
    for new_token in streamer:
        if new_token != '<':
            partial_message += new_token
            yield partial_message

gr.ChatInterface(predict).launch()

3. Open a new anaconda prompt and enter the following commands:

pip install gradio
conda activate llm
cd to the directory that chat_gradio.py is located in
python chatbot_gradio.py

4. Open your web browser and navigate to 127.0.0.1:7860. You should see a chatbot set up with the mistral-7b-instruct-v0.2 language model! You now have a fancy looking webui for your chatbot.

5. Ask a question to start a conversation with your chatbot.

Notices and Disclaimers

Performance varies by use, configuration, and other factors. Learn more on the Performance Index site.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure.

Results that are based on pre-production systems and components as well as results that have been estimated or simulated using an Intel Reference Platform (an internal example new system), internal Intel analysis or architecture simulation or modeling are provided to you for informational purposes only. Results may vary based on future changes to any systems, components, specifications or configurations.

Chi phí và kết quả của quý vị có thể khác nhau.

Intel technologies may require enabled hardware, software or service activation.

*Other names and brands may be claimed as the property of others.