🤖 Transformers on Jetson¶
Author: Dr. Kaikai Liu, Ph.D.
Position: Associate Professor, Computer Engineering
Institution: San Jose State University
Contact: kaikai.liu@sjsu.edu
🧠 What Are Transformers?¶
Transformers are a type of deep learning model designed to handle sequential data, such as text, audio, or even images. Introduced in the 2017 paper "Attention Is All You Need," transformers replaced recurrent neural networks in many NLP tasks.
🔑 Key Components¶
- Self-Attention: Each token attends to all other tokens in a sequence.
- Positional Encoding: Adds order information to input tokens.
- Multi-head Attention: Parallel attention mechanisms capture different relationships.
- Feedforward Layers: Apply transformations independently to each position.
📚 Popular Transformer Architectures¶
Model | Purpose | Examples |
---|---|---|
BERT | Encoder (bi-directional) | Question answering, embeddings |
GPT | Decoder (uni-directional) | Text generation |
T5 | Encoder-Decoder | Translation, summarization |
LLaMA/Qwen | Open-source LLMs | General language modeling |
🤗 HuggingFace Transformers on Jetson¶
While large LLMs require quantization, many HuggingFace models (BERT, DistilBERT, TinyGPT) can run on Jetson using PyTorch + Transformers with ONNX export or quantized alternatives.
🚀 Basic vs Accelerated Inference¶
Approach | Speed | Memory | Complexity | Best For |
---|---|---|---|---|
Basic PyTorch | Baseline | High | Low | Development, prototyping |
ONNX Runtime | 2-3x faster | Medium | Medium | Production inference |
TensorRT | 3-5x faster | Low | High | Optimized deployment |
Quantization | 2-4x faster | 50% less | Medium | Resource-constrained |
✨ What is NLP?¶
Natural Language Processing (NLP) is a subfield of AI that enables machines to read, understand, and generate human language.
💬 Common NLP Tasks¶
- Text Classification (e.g., sentiment analysis, spam detection)
- Named Entity Recognition (NER) (extracting entities like names, locations)
- Machine Translation (translating between languages)
- Question Answering (extracting answers from context)
- Text Summarization (generating concise summaries)
- Chatbots & Conversational AI (interactive dialogue systems)
- Text Generation (creating human-like text)
- Information Extraction (structured data from unstructured text)
🔧 Comprehensive HuggingFace Examples with transformers_llm_demo.py
¶
Instead of implementing individual examples, we've created a comprehensive demonstration script called transformers_llm_demo.py
that showcases various NLP applications using HuggingFace transformers with optimization techniques specifically designed for Jetson devices.
This script provides a modular, command-line driven interface for exploring different NLP tasks and acceleration methods. Let's explore the key features and optimization techniques implemented in this demo.
📋 Available Applications¶
The transformers_llm_demo.py
script supports seven different NLP applications:
- Text Classification (Sentiment Analysis)
- Analyzes text sentiment using DistilBERT models
-
Provides both basic and ONNX-optimized implementations
-
Text Generation (GPT-2)
- Generates text continuations from prompts using GPT-2
-
Implements both basic and quantized+GPU accelerated versions
-
Question Answering (BERT)
- Extracts answers from context passages using BERT models
-
Offers basic pipeline and optimized JIT-compiled implementations
-
Named Entity Recognition (NER)
- Identifies entities (people, organizations, locations) in text
-
Provides both basic and batch-optimized implementations
-
Batch Processing Demo
- Demonstrates efficient processing of multiple texts
-
Automatically determines optimal batch sizes for your hardware
-
Model Benchmarking
- Measures performance metrics across multiple runs
-
Reports detailed statistics on inference time and resource usage
-
Performance Comparison
- Directly compares basic vs. optimized implementations
- Calculates speedup factors and memory efficiency gains
⚡ Optimization Techniques¶
The demo implements several optimization techniques that are particularly valuable for edge devices like the Jetson:
1. ONNX Runtime and TensorRT Acceleration¶
What it does: Provides hardware-optimized inference using ONNX Runtime with GPU acceleration and TensorRT integration for maximum performance on Jetson devices.
Implementation details:
- Uses onnxruntime
directly with CUDA execution provider for GPU acceleration
- Integrates tensorrt
for additional optimization on NVIDIA hardware
- Automatically selects appropriate execution provider (CUDA, TensorRT, or CPU)
- Handles fallback to basic implementation if acceleration libraries are unavailable
What you need to add:
- Install ONNX Runtime GPU: pip install onnxruntime-gpu
- Install TensorRT: Follow NVIDIA's installation guide for your Jetson device
- For optimal performance, ensure both libraries are properly configured for your hardware
2. 8-bit Quantization¶
What it does: Reduces model precision from 32-bit to 8-bit, decreasing memory usage and increasing inference speed.
Implementation details:
- Uses BitsAndBytesConfig
for configuring quantization parameters
- Enables FP32 CPU offloading for handling operations not supported in INT8
- Combines with FP16 (half-precision) for operations that benefit from it
What you need to add:
- Install bitsandbytes: pip install bitsandbytes
- May require Jetson-specific compilation for optimal performance
3. JIT Compilation¶
What it does: Compiles model operations into optimized machine code at runtime.
Implementation details:
- Uses torch.jit.script()
to compile models
- Implements graceful fallback if compilation fails
- Applied to question answering models for faster inference
What you need to add: - No additional packages required (built into PyTorch) - Ensure you're using a recent PyTorch version with good JIT support
4. Batch Processing¶
What it does: Processes multiple inputs simultaneously for higher throughput.
Implementation details:
- Custom TextDataset
class for efficient batch handling
- Dynamic batch size determination based on available memory
- Particularly effective for NER and classification tasks
What you need to add: - No additional packages required - Consider adjusting batch sizes based on your specific Jetson model
5. GPU Memory Optimization¶
What it does: Carefully manages GPU memory to prevent out-of-memory errors on memory-constrained devices.
Implementation details:
- Implements find_optimal_batch_size()
to automatically determine the largest workable batch size
- Uses torch.cuda.empty_cache()
to free memory between operations
- Monitors memory usage with the performance_monitor()
context manager
What you need to add:
- Optional: Install GPUtil for enhanced GPU monitoring: pip install gputil
6. KV Caching for Text Generation¶
What it does: Caches key-value pairs in transformer attention layers to avoid redundant computations during text generation.
Implementation details:
- Enables use_cache=True
in the model generation parameters
- Particularly effective for autoregressive generation tasks
- Combined with quantization for maximum efficiency
What you need to add: - No additional packages required (built into transformers library)
🚀 Using the Demo¶
The demo can be run from the command line with various options:
# List available applications
python transformers_llm_demo.py --list
# Run text classification with optimization
python transformers_llm_demo.py --app 1 --text "Jetson is amazing for edge AI!" --optimize
# Run text generation with custom parameters
python transformers_llm_demo.py --app 2 --prompt "Edge AI computing with Jetson" --max-length 100
# Run question answering
python transformers_llm_demo.py --app 3 --question "How many CUDA cores?" --context "The Jetson has 1024 CUDA cores"
The script provides detailed performance metrics for each run, including: - Inference time - Memory usage - CPU/GPU utilization - Temperature monitoring (when available)
📊 Performance Monitoring¶
The demo includes a comprehensive performance monitoring system that tracks:
- Execution time for each operation
- GPU memory allocation and usage
- CPU utilization changes
- GPU load and temperature (when available)
This monitoring helps identify bottlenecks and optimize your models for the specific constraints of Jetson devices.