๐ง NLP Applications & LLM Optimization on Jetson¶
Author: Dr. Kaikai Liu, Ph.D.
Position: Associate Professor, Computer Engineering
Institution: San Jose State University
Contact: kaikai.liu@sjsu.edu
This tutorial covers essential techniques for deploying and optimizing NLP applications on Jetson devices. You'll learn about various optimization strategies, benchmark different NLP tasks, and implement production-ready solutions. All examples and implementations are combined into a single, easy-to-use command-line tool (jetson_nlp_toolkit.py
) that you can use to experiment with different approaches.
๐ค Why Optimize LLMs on Jetson?¶
Jetson Orin Nano has limited power and memory (e.g., 8GB), so optimizing models for:
- ๐พ Lower memory usage
- โก Faster inference latency
- ๐ Better energy efficiency
Enables real-time NLP applications at the edge.
๐ Optimization Strategies¶
โ 1. Model Quantization¶
Quantization reduces the precision of model weights (e.g., FP32 โ INT8 or Q4) to shrink size and improve inference speed.
๐ What is Q4_K_M?¶
- Q4 = 4-bit quantization (16x smaller than FP32)
- K = Grouped quantization for accuracy preservation
- M = Variant with optimized metadata handling
Q4_K_M is commonly used in llama.cpp
for best quality/speed tradeoff on Jetson.
โ 2. Use Smaller or Distilled Models¶
Distillation creates smaller models (e.g., DistilBERT) by mimicking larger models while reducing parameters.
- Faster and lighter than full LLMs
โ 3. Use TensorRT or ONNX for Inference¶
Export HuggingFace or PyTorch models to ONNX and use:
onnxruntime-gpu
TensorRT
engines (for low latency and reduced memory use)
โ 4. Offload Selected Layers¶
For large models, tools like llama-cpp-python
allow setting n_gpu_layers
to control how many transformer layers use GPU vs CPU.
๐ NLP Application Evaluation Labs¶
๐ ๏ธ All-in-One NLP Toolkit¶
We've created a comprehensive Python script that combines all the NLP applications, optimization techniques, and evaluation methods from this tutorial into a single command-line tool.
<!-- #### Installation
# Clone the repository if you haven't already
git clone https://github.com/yourusername/edgeAI.git
cd edgeAI
# Install dependencies
pip install torch transformers datasets evaluate rouge-score fastapi uvicorn aiohttp psutil matplotlib numpy
# Optional dependencies for specific features
pip install redis llama-cpp-python requests
``` -->
#### Usage
The toolkit provides several commands for different NLP tasks and optimizations:
```bash
python jetson_nlp_toolkit.py [command] [options]
Available commands:
- evaluate
: Run NLP evaluation suite
- optimize
: Run optimization benchmarks
- llm
: Compare LLM inference methods
- server
: Run NLP server
- loadtest
: Run load tests against NLP server
Examples¶
1. Evaluate NLP Tasks:
# Run full evaluation suite
python jetson_nlp_toolkit.py evaluate --task all
# Evaluate specific task (sentiment, qa, summarization, ner)
python jetson_nlp_toolkit.py evaluate --task sentiment
# Save results to custom file
python jetson_nlp_toolkit.py evaluate --task all --output my_results.json
2. Benchmark Optimization Techniques:
# Compare quantization methods
python jetson_nlp_toolkit.py optimize --method quantization
# Test model pruning
python jetson_nlp_toolkit.py optimize --method pruning --ratio 0.3
# Use custom model
python jetson_nlp_toolkit.py optimize --method all --model bert-base-uncased
3. Compare LLM Inference Methods:
# Test all inference methods
python jetson_nlp_toolkit.py llm --method all
# Test specific method with custom model
python jetson_nlp_toolkit.py llm --method huggingface --model gpt2
# Test llama.cpp with custom model path
python jetson_nlp_toolkit.py llm --method llamacpp --model-path /path/to/model.gguf
4. Run NLP Server:
# Start server on default port (8000)
python jetson_nlp_toolkit.py server
# Specify host and port
python jetson_nlp_toolkit.py server --host 127.0.0.1 --port 5000
5. Run Load Tests:
# Test server with default settings
python jetson_nlp_toolkit.py loadtest
# Custom test configuration
python jetson_nlp_toolkit.py loadtest --url http://localhost:8000 --concurrent 20 --requests 500
๐งช Lab 1: Multi-Application NLP Benchmark Suite¶
Objective: Evaluate and compare different NLP applications on Jetson using standardized datasets
Setup Evaluation Environment¶
# Create evaluation container
docker run --rm -it --runtime nvidia \
-v $(pwd)/nlp_eval:/workspace \
-v $(pwd)/datasets:/datasets \
nvcr.io/nvidia/pytorch:24.04-py3 /bin/bash
# Install evaluation dependencies
pip install transformers datasets evaluate rouge-score sacrebleu spacy
python -m spacy download en_core_web_sm
Run Comprehensive Evaluation¶
Use the
Ollama Container:¶
docker run --rm -it --network host \
-v ollama:/root/.ollama ollama/ollama
๐ Run LLM Inference Comparison¶
Use the
# Run all NLP evaluations
python jetson_nlp_toolkit.py evaluate --task all
# Run specific task evaluations
python jetson_nlp_toolkit.py evaluate --task sentiment
python jetson_nlp_toolkit.py evaluate --task qa
python jetson_nlp_toolkit.py evaluate --task summarization
python jetson_nlp_toolkit.py evaluate --task ner
# Save results to custom file
python jetson_nlp_toolkit.py evaluate --output my_results.json
The evaluation suite includes: - Sentiment Analysis: IMDB dataset with DistilBERT - Question Answering: SQuAD dataset with DistilBERT - Text Summarization: CNN/DailyMail with T5-small - Named Entity Recognition: CoNLL-2003 with BERT-large
Each evaluation measures: - Accuracy/F1 scores - Latency and throughput - GPU memory usage - CPU utilization
๐งช Lab 2: Advanced Optimization Techniques¶
Objective: Implement and compare advanced optimization strategies for NLP models on Jetson
Run Optimization Benchmarks¶
Use the
๐ Record Results¶
Method | Latency (s) | Tokens/sec | GPU Mem (MB) |
---|---|---|---|
HuggingFace PyTorch | |||
llama-cpp-python | |||
Ollama REST API |
Use tegrastats
or jtop
to observe GPU memory and CPU usage during inference.
๐ Lab Deliverables¶
For Lab 1 (Multi-Application Benchmark):¶
- Completed evaluation results JSON file
- Performance comparison charts for all NLP tasks
- Analysis report identifying best models for each task on Jetson
- Resource utilization graphs (
tegrastats
screenshots)
For Lab 2 (Optimization Techniques):¶
- Quantization comparison table
- Memory usage analysis
- Speedup and compression ratio calculations
- Recommendations for production deployment
For Lab 3 (LLM Container Comparison):¶
- Completed benchmark table
- Screenshots of
tegrastats
during inference - Analysis: Which approach is fastest, lightest, and most accurate for Jetson?
๐ฏ Advanced NLP Optimization Strategies¶
1. ๐ง Model Pruning for Jetson¶
# model_pruning.py
import torch
import torch.nn.utils.prune as prune
from transformers import AutoModelForSequenceClassification
def prune_model(model, pruning_ratio=0.2):
"""Apply structured pruning to transformer model"""
for name, module in model.named_modules():
if isinstance(module, torch.nn.Linear):
prune.l1_unstructured(module, name='weight', amount=pruning_ratio)
prune.remove(module, 'weight')
return model
# Example usage
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
pruned_model = prune_model(model, pruning_ratio=0.3)
2. ๐ Knowledge Distillation¶
# knowledge_distillation.py
import torch
import torch.nn as nn
from transformers import AutoModelForSequenceClassification, AutoTokenizer
class DistillationTrainer:
def __init__(self, teacher_model, student_model, temperature=3.0, alpha=0.7):
self.teacher = teacher_model
self.student = student_model
self.temperature = temperature
self.alpha = alpha
self.kl_loss = nn.KLDivLoss(reduction='batchmean')
self.ce_loss = nn.CrossEntropyLoss()
def distillation_loss(self, student_logits, teacher_logits, labels):
"""Calculate distillation loss"""
# Soft targets from teacher
teacher_probs = torch.softmax(teacher_logits / self.temperature, dim=1)
student_log_probs = torch.log_softmax(student_logits / self.temperature, dim=1)
# Distillation loss
distill_loss = self.kl_loss(student_log_probs, teacher_probs) * (self.temperature ** 2)
# Hard target loss
hard_loss = self.ce_loss(student_logits, labels)
# Combined loss
total_loss = self.alpha * distill_loss + (1 - self.alpha) * hard_loss
return total_loss
3. ๐ Dynamic Batching for Real-time Inference¶
# dynamic_batching.py
import asyncio
import time
from collections import deque
from typing import List, Tuple
class DynamicBatcher:
def __init__(self, model, tokenizer, max_batch_size=8, max_wait_time=0.1):
self.model = model
self.tokenizer = tokenizer
self.max_batch_size = max_batch_size
self.max_wait_time = max_wait_time
self.request_queue = deque()
self.processing = False
async def add_request(self, text: str) -> str:
"""Add inference request to queue"""
future = asyncio.Future()
self.request_queue.append((text, future))
if not self.processing:
asyncio.create_task(self.process_batch())
return await future
async def process_batch(self):
"""Process requests in batches"""
self.processing = True
while self.request_queue:
batch = []
futures = []
start_time = time.time()
# Collect batch
while (len(batch) < self.max_batch_size and
self.request_queue and
(time.time() - start_time) < self.max_wait_time):
text, future = self.request_queue.popleft()
batch.append(text)
futures.append(future)
if not self.request_queue:
await asyncio.sleep(0.01) # Small wait for more requests
if batch:
# Process batch
results = await self.inference_batch(batch)
# Return results
for future, result in zip(futures, results):
future.set_result(result)
self.processing = False
async def inference_batch(self, texts: List[str]) -> List[str]:
"""Run inference on batch"""
inputs = self.tokenizer(
texts,
return_tensors="pt",
padding=True,
truncation=True,
max_length=512
)
with torch.no_grad():
outputs = self.model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)
return [f"Prediction: {pred.item()}" for pred in predictions]
4. ๐ Real-time Performance Monitoring¶
# performance_monitor.py
import time
import psutil
import torch
from collections import deque
import matplotlib.pyplot as plt
from threading import Thread
class JetsonNLPMonitor:
def __init__(self, window_size=100):
self.window_size = window_size
self.metrics = {
'latency': deque(maxlen=window_size),
'throughput': deque(maxlen=window_size),
'gpu_memory': deque(maxlen=window_size),
'cpu_usage': deque(maxlen=window_size),
'timestamps': deque(maxlen=window_size)
}
self.monitoring = False
def start_monitoring(self):
"""Start background monitoring"""
self.monitoring = True
monitor_thread = Thread(target=self._monitor_loop)
monitor_thread.daemon = True
monitor_thread.start()
def stop_monitoring(self):
"""Stop monitoring"""
self.monitoring = False
def _monitor_loop(self):
"""Background monitoring loop"""
while self.monitoring:
timestamp = time.time()
# GPU memory
if torch.cuda.is_available():
gpu_memory = torch.cuda.memory_allocated() / 1024 / 1024 # MB
else:
gpu_memory = 0
# CPU usage
cpu_usage = psutil.cpu_percent()
self.metrics['gpu_memory'].append(gpu_memory)
self.metrics['cpu_usage'].append(cpu_usage)
self.metrics['timestamps'].append(timestamp)
time.sleep(0.1) # Monitor every 100ms
def log_inference(self, latency, batch_size=1):
"""Log inference metrics"""
self.metrics['latency'].append(latency * 1000) # Convert to ms
self.metrics['throughput'].append(batch_size / latency) # samples/sec
def get_stats(self):
"""Get current statistics"""
if not self.metrics['latency']:
return {}
return {
'avg_latency_ms': sum(self.metrics['latency']) / len(self.metrics['latency']),
'avg_throughput': sum(self.metrics['throughput']) / len(self.metrics['throughput']),
'avg_gpu_memory_mb': sum(self.metrics['gpu_memory']) / len(self.metrics['gpu_memory']),
'avg_cpu_usage': sum(self.metrics['cpu_usage']) / len(self.metrics['cpu_usage']),
'total_inferences': len(self.metrics['latency'])
}
def plot_metrics(self, save_path="nlp_performance.png"):
"""Plot performance metrics"""
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
# Latency
axes[0, 0].plot(list(self.metrics['latency']))
axes[0, 0].set_title('Inference Latency (ms)')
axes[0, 0].set_ylabel('Latency (ms)')
# Throughput
axes[0, 1].plot(list(self.metrics['throughput']))
axes[0, 1].set_title('Throughput (samples/sec)')
axes[0, 1].set_ylabel('Samples/sec')
# GPU Memory
axes[1, 0].plot(list(self.metrics['gpu_memory']))
axes[1, 0].set_title('GPU Memory Usage (MB)')
axes[1, 0].set_ylabel('Memory (MB)')
# CPU Usage
axes[1, 1].plot(list(self.metrics['cpu_usage']))
axes[1, 1].set_title('CPU Usage (%)')
axes[1, 1].set_ylabel('CPU %')
plt.tight_layout()
plt.savefig(save_path)
plt.show()
print(f"๐ Performance plots saved to {save_path}")
๐งช Bonus Lab: Export HuggingFace โ ONNX โ TensorRT¶
- Export:
import torch
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
dummy = torch.randint(0, 100, (1, 64))
torch.onnx.export(model, (dummy,), "model.onnx", input_names=["input_ids"])
- Convert:
trtexec --onnx=model.onnx --saveEngine=model.trt
- Run using TensorRT Python bindings or
onnxruntime-gpu
๐ Production Deployment Strategies¶
1. ๐ณ Multi-Stage Docker Optimization¶
# Dockerfile.nlp-production
# Multi-stage build for optimized NLP deployment
FROM nvcr.io/nvidia/pytorch:24.04-py3 as builder
# Install build dependencies
RUN pip install transformers torch-audio torchaudio torchvision
RUN pip install onnx onnxruntime-gpu tensorrt
# Copy and optimize models
COPY models/ /tmp/models/
COPY scripts/optimize_models.py /tmp/
RUN python /tmp/optimize_models.py
# Production stage
FROM nvcr.io/nvidia/pytorch:24.04-py3
# Install only runtime dependencies
RUN pip install --no-cache-dir \
transformers==4.36.0 \
torch==2.1.0 \
onnxruntime-gpu==1.16.0 \
fastapi==0.104.0 \
uvicorn==0.24.0
# Copy optimized models
COPY --from=builder /tmp/optimized_models/ /app/models/
COPY src/ /app/src/
WORKDIR /app
EXPOSE 8000
CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000"]
2. ๐ FastAPI Production Server¶
Use the
# Run all optimization benchmarks
python jetson_nlp_toolkit.py optimize --method all
# Run specific optimization methods
python jetson_nlp_toolkit.py optimize --method quantization
python jetson_nlp_toolkit.py optimize --method pruning --ratio 0.3
python jetson_nlp_toolkit.py optimize --method distillation
# Benchmark with custom model
python jetson_nlp_toolkit.py optimize --model "bert-base-uncased" --samples 100
# Save optimization results
python jetson_nlp_toolkit.py optimize --output optimization_results.json
The optimization benchmark compares: - FP32: Full precision baseline - FP16: Half precision for GPU acceleration - INT8 Dynamic: Dynamic quantization for CPU - TorchScript: Graph optimization - Model Pruning: Structured weight pruning - Knowledge Distillation: Teacher-student training
Metrics measured: - Average latency per sample - Throughput (samples/second) - Memory usage - Model size compression - Speedup vs baseline
๐งช Lab 3: Compare LLM Inference in Containers¶
๐ฏ Objective¶
Evaluate inference speed and memory usage for different LLM deployment methods on Jetson inside Docker containers.
๐ง Setup Container for Each Method¶
HuggingFace Container:¶
docker run --rm -it --runtime nvidia \
-v $(pwd):/workspace \
nvcr.io/nvidia/pytorch:24.04-py3 /bin/bash
Inside container:
pip install transformers accelerate torch
llama.cpp Container:¶
docker run --rm -it --runtime nvidia \
-v $(pwd)/models:/models \
jetson-llama-cpp /bin/bash
(Assumes container has CUDA + llama.cpp compiled)
Ollama Container:¶
docker run --rm -it --network host \
-v ollama:/root/.ollama ollama/ollama
๐ Run LLM Inference Comparison¶
Use the
# performance_monitor.py
import time
import psutil
import torch
from collections import deque
import matplotlib.pyplot as plt
from threading import Thread
class JetsonNLPMonitor:
def __init__(self, window_size=100):
self.window_size = window_size
self.metrics = {
'latency': deque(maxlen=window_size),
'throughput': deque(maxlen=window_size),
'gpu_memory': deque(maxlen=window_size),
'cpu_usage': deque(maxlen=window_size),
'timestamps': deque(maxlen=window_size)
}
self.monitoring = False
def start_monitoring(self):
"""Start background monitoring"""
self.monitoring = True
monitor_thread = Thread(target=self._monitor_loop)
monitor_thread.daemon = True
monitor_thread.start()
def stop_monitoring(self):
"""Stop monitoring"""
self.monitoring = False
def _monitor_loop(self):
"""Background monitoring loop"""
while self.monitoring:
timestamp = time.time()
# GPU memory
if torch.cuda.is_available():
gpu_memory = torch.cuda.memory_allocated() / 1024 / 1024 # MB
else:
gpu_memory = 0
# CPU usage
cpu_usage = psutil.cpu_percent()
self.metrics['gpu_memory'].append(gpu_memory)
self.metrics['cpu_usage'].append(cpu_usage)
self.metrics['timestamps'].append(timestamp)
time.sleep(0.1) # Monitor every 100ms
def log_inference(self, latency, batch_size=1):
"""Log inference metrics"""
self.metrics['latency'].append(latency * 1000) # Convert to ms
self.metrics['throughput'].append(batch_size / latency) # samples/sec
def get_stats(self):
"""Get current statistics"""
if not self.metrics['latency']:
return {}
return {
'avg_latency_ms': sum(self.metrics['latency']) / len(self.metrics['latency']),
'avg_throughput': sum(self.metrics['throughput']) / len(self.metrics['throughput']),
'avg_gpu_memory_mb': sum(self.metrics['gpu_memory']) / len(self.metrics['gpu_memory']),
'avg_cpu_usage': sum(self.metrics['cpu_usage']) / len(self.metrics['cpu_usage']),
'total_inferences': len(self.metrics['latency'])
}
def plot_metrics(self, save_path="nlp_performance.png"):
"""Plot performance metrics"""
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
# Latency
axes[0, 0].plot(list(self.metrics['latency']))
axes[0, 0].set_title('Inference Latency (ms)')
axes[0, 0].set_ylabel('Latency (ms)')
# Throughput
axes[0, 1].plot(list(self.metrics['throughput']))
axes[0, 1].set_title('Throughput (samples/sec)')
axes[0, 1].set_ylabel('Samples/sec')
# GPU Memory
axes[1, 0].plot(list(self.metrics['gpu_memory']))
axes[1, 0].set_title('GPU Memory Usage (MB)')
axes[1, 0].set_ylabel('Memory (MB)')
# CPU Usage
axes[1, 1].plot(list(self.metrics['cpu_usage']))
axes[1, 1].set_title('CPU Usage (%)')
axes[1, 1].set_ylabel('CPU %')
plt.tight_layout()
plt.savefig(save_path)
plt.show()
print(f"๐ Performance plots saved to {save_path}")
๐งช Bonus Lab: Export HuggingFace โ ONNX โ TensorRT¶
- Export:
import torch
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
dummy = torch.randint(0, 100, (1, 64))
torch.onnx.export(model, (dummy,), "model.onnx", input_names=["input_ids"])
- Convert:
trtexec --onnx=model.onnx --saveEngine=model.trt
- Run using TensorRT Python bindings or
onnxruntime-gpu
๐ Production Deployment Strategies¶
1. ๐ณ Multi-Stage Docker Optimization¶
# Dockerfile.nlp-production
# Multi-stage build for optimized NLP deployment
FROM nvcr.io/nvidia/pytorch:24.04-py3 as builder
# Install build dependencies
RUN pip install transformers torch-audio torchaudio torchvision
RUN pip install onnx onnxruntime-gpu tensorrt
# Copy and optimize models
COPY models/ /tmp/models/
COPY scripts/optimize_models.py /tmp/
RUN python /tmp/optimize_models.py
# Production stage
FROM nvcr.io/nvidia/pytorch:24.04-py3
# Install only runtime dependencies
RUN pip install --no-cache-dir \
transformers==4.36.0 \
torch==2.1.0 \
onnxruntime-gpu==1.16.0 \
fastapi==0.104.0 \
uvicorn==0.24.0
# Copy optimized models
COPY --from=builder /tmp/optimized_models/ /app/models/
COPY src/ /app/src/
WORKDIR /app
EXPOSE 8000
CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000"]
2. ๐ FastAPI Production Server¶
Use the
# Compare all available LLM inference methods
python jetson_nlp_toolkit.py llm --method all
# Test specific inference methods
python jetson_nlp_toolkit.py llm --method huggingface --model "microsoft/DialoGPT-small"
python jetson_nlp_toolkit.py llm --method llamacpp --model-path "/models/mistral.gguf"
python jetson_nlp_toolkit.py llm --method ollama --model "mistral"
# Custom prompts and settings
python jetson_nlp_toolkit.py llm --prompts "Explain the future of AI in education." --max-tokens 100
# Save comparison results
python jetson_nlp_toolkit.py llm --output llm_comparison_results.json
The LLM comparison evaluates:
- HuggingFace Transformers: GPU-optimized inference
- llama.cpp: CPU-optimized quantized models
- Ollama API: Containerized LLM serving
Metrics measured: - Average latency per prompt - Tokens generated per second - Memory usage - Model loading time - Response quality
๐ Record Results¶
Method | Latency (s) | Tokens/sec | GPU Mem (MB) |
---|---|---|---|
HuggingFace PyTorch | |||
llama-cpp-python | |||
Ollama REST API |
Use tegrastats
or jtop
to observe GPU memory and CPU usage during inference.
๐ Lab Deliverables¶
For Lab 1 (Multi-Application Benchmark):¶
- Completed evaluation results JSON file
- Performance comparison charts for all NLP tasks
- Analysis report identifying best models for each task on Jetson
- Resource utilization graphs (
tegrastats
screenshots)
For Lab 2 (Optimization Techniques):¶
- Quantization comparison table
- Memory usage analysis
- Speedup and compression ratio calculations
- Recommendations for production deployment
For Lab 3 (LLM Container Comparison):¶
- Completed benchmark table
- Screenshots of
tegrastats
during inference - Analysis: Which approach is fastest, lightest, and most accurate for Jetson?
๐ฏ Advanced NLP Optimization Strategies¶
1. ๐ง Model Pruning for Jetson¶
# model_pruning.py
import torch
import torch.nn.utils.prune as prune
from transformers import AutoModelForSequenceClassification
def prune_model(model, pruning_ratio=0.2):
"""Apply structured pruning to transformer model"""
for name, module in model.named_modules():
if isinstance(module, torch.nn.Linear):
prune.l1_unstructured(module, name='weight', amount=pruning_ratio)
prune.remove(module, 'weight')
return model
# Example usage
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
pruned_model = prune_model(model, pruning_ratio=0.3)
2. ๐ Knowledge Distillation¶
# knowledge_distillation.py
import torch
import torch.nn as nn
from transformers import AutoModelForSequenceClassification, AutoTokenizer
class DistillationTrainer:
def __init__(self, teacher_model, student_model, temperature=3.0, alpha=0.7):
self.teacher = teacher_model
self.student = student_model
self.temperature = temperature
self.alpha = alpha
self.kl_loss = nn.KLDivLoss(reduction='batchmean')
self.ce_loss = nn.CrossEntropyLoss()
def distillation_loss(self, student_logits, teacher_logits, labels):
"""Calculate distillation loss"""
# Soft targets from teacher
teacher_probs = torch.softmax(teacher_logits / self.temperature, dim=1)
student_log_probs = torch.log_softmax(student_logits / self.temperature, dim=1)
# Distillation loss
distill_loss = self.kl_loss(student_log_probs, teacher_probs) * (self.temperature ** 2)
# Hard target loss
hard_loss = self.ce_loss(student_logits, labels)
# Combined loss
total_loss = self.alpha * distill_loss + (1 - self.alpha) * hard_loss
return total_loss
3. ๐ Dynamic Batching for Real-time Inference¶
# dynamic_batching.py
import asyncio
import time
from collections import deque
from typing import List, Tuple
class DynamicBatcher:
def __init__(self, model, tokenizer, max_batch_size=8, max_wait_time=0.1):
self.model = model
self.tokenizer = tokenizer
self.max_batch_size = max_batch_size
self.max_wait_time = max_wait_time
self.request_queue = deque()
self.processing = False
async def add_request(self, text: str) -> str:
"""Add inference request to queue"""
future = asyncio.Future()
self.request_queue.append((text, future))
if not self.processing:
asyncio.create_task(self.process_batch())
return await future
async def process_batch(self):
"""Process requests in batches"""
self.processing = True
while self.request_queue:
batch = []
futures = []
start_time = time.time()
# Collect batch
while (len(batch) < self.max_batch_size and
self.request_queue and
(time.time() - start_time) < self.max_wait_time):
text, future = self.request_queue.popleft()
batch.append(text)
futures.append(future)
if not self.request_queue:
await asyncio.sleep(0.01) # Small wait for more requests
if batch:
# Process batch
results = await self.inference_batch(batch)
# Return results
for future, result in zip(futures, results):
future.set_result(result)
self.processing = False
async def inference_batch(self, texts: List[str]) -> List[str]:
"""Run inference on batch"""
inputs = self.tokenizer(
texts,
return_tensors="pt",
padding=True,
truncation=True,
max_length=512
)
with torch.no_grad():
outputs = self.model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)
return [f"Prediction: {pred.item()}" for pred in predictions]
4. ๐ Real-time Performance Monitoring¶
# performance_monitor.py
import time
import psutil
import torch
from collections import deque
import matplotlib.pyplot as plt
from threading import Thread
class JetsonNLPMonitor:
def __init__(self, window_size=100):
self.window_size = window_size
self.metrics = {
'latency': deque(maxlen=window_size),
'throughput': deque(maxlen=window_size),
'gpu_memory': deque(maxlen=window_size),
'cpu_usage': deque(maxlen=window_size),
'timestamps': deque(maxlen=window_size)
}
self.monitoring = False
def start_monitoring(self):
"""Start background monitoring"""
self.monitoring = True
monitor_thread = Thread(target=self._monitor_loop)
monitor_thread.daemon = True
monitor_thread.start()
def stop_monitoring(self):
"""Stop monitoring"""
self.monitoring = False
def _monitor_loop(self):
"""Background monitoring loop"""
while self.monitoring:
timestamp = time.time()
# GPU memory
if torch.cuda.is_available():
gpu_memory = torch.cuda.memory_allocated() / 1024 / 1024 # MB
else:
gpu_memory = 0
# CPU usage
cpu_usage = psutil.cpu_percent()
self.metrics['gpu_memory'].append(gpu_memory)
self.metrics['cpu_usage'].append(cpu_usage)
self.metrics['timestamps'].append(timestamp)
time.sleep(0.1) # Monitor every 100ms
def log_inference(self, latency, batch_size=1):
"""Log inference metrics"""
self.metrics['latency'].append(latency * 1000) # Convert to ms
self.metrics['throughput'].append(batch_size / latency) # samples/sec
def get_stats(self):
"""Get current statistics"""
if not self.metrics['latency']:
return {}
return {
'avg_latency_ms': sum(self.metrics['latency']) / len(self.metrics['latency']),
'avg_throughput': sum(self.metrics['throughput']) / len(self.metrics['throughput']),
'avg_gpu_memory_mb': sum(self.metrics['gpu_memory']) / len(self.metrics['gpu_memory']),
'avg_cpu_usage': sum(self.metrics['cpu_usage']) / len(self.metrics['cpu_usage']),
'total_inferences': len(self.metrics['latency'])
}
def plot_metrics(self, save_path="nlp_performance.png"):
"""Plot performance metrics"""
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
# Latency
axes[0, 0].plot(list(self.metrics['latency']))
axes[0, 0].set_title('Inference Latency (ms)')
axes[0, 0].set_ylabel('Latency (ms)')
# Throughput
axes[0, 1].plot(list(self.metrics['throughput']))
axes[0, 1].set_title('Throughput (samples/sec)')
axes[0, 1].set_ylabel('Samples/sec')
# GPU Memory
axes[1, 0].plot(list(self.metrics['gpu_memory']))
axes[1, 0].set_title('GPU Memory Usage (MB)')
axes[1, 0].set_ylabel('Memory (MB)')
# CPU Usage
axes[1, 1].plot(list(self.metrics['cpu_usage']))
axes[1, 1].set_title('CPU Usage (%)')
axes[1, 1].set_ylabel('CPU %')
plt.tight_layout()
plt.savefig(save_path)
plt.show()
print(f"๐ Performance plots saved to {save_path}")
๐งช Bonus Lab: Export HuggingFace โ ONNX โ TensorRT¶
- Export:
import torch
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
dummy = torch.randint(0, 100, (1, 64))
torch.onnx.export(model, (dummy,), "model.onnx", input_names=["input_ids"])
- Convert:
trtexec --onnx=model.onnx --saveEngine=model.trt
- Run using TensorRT Python bindings or
onnxruntime-gpu
๐ Production Deployment Strategies¶
1. ๐ณ Multi-Stage Docker Optimization¶
# Dockerfile.nlp-production
# Multi-stage build for optimized NLP deployment
FROM nvcr.io/nvidia/pytorch:24.04-py3 as builder
# Install build dependencies
RUN pip install transformers torch-audio torchaudio torchvision
RUN pip install onnx onnxruntime-gpu tensorrt
# Copy and optimize models
COPY models/ /tmp/models/
COPY scripts/optimize_models.py /tmp/
RUN python /tmp/optimize_models.py
# Production stage
FROM nvcr.io/nvidia/pytorch:24.04-py3
# Install only runtime dependencies
RUN pip install --no-cache-dir \
transformers==4.36.0 \
torch==2.1.0 \
onnxruntime-gpu==1.16.0 \
fastapi==0.104.0 \
uvicorn==0.24.0
# Copy optimized models
COPY --from=builder /tmp/optimized_models/ /app/models/
COPY src/ /app/src/
WORKDIR /app
EXPOSE 8000
CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000"]
2. ๐ FastAPI Production Server¶
Use the
# Start production NLP server
python jetson_nlp_toolkit.py server --host 0.0.0.0 --port 8000
# Start with custom models
python jetson_nlp_toolkit.py server --sentiment-model "cardiffnlp/twitter-roberta-base-sentiment-latest"
# Start with monitoring enabled
python jetson_nlp_toolkit.py server --enable-monitoring
The production server includes: - FastAPI Framework: High-performance async API - Model Optimization: FP16 precision and GPU acceleration - Batch Processing: Efficient handling of multiple requests - Health Monitoring: System status and performance metrics - Error Handling: Robust error management and logging - WebSocket Support: Real-time chat interface
API Endpoints Available:¶
- GET /health: System health and model status
- POST /sentiment: Batch sentiment analysis
- POST /qa: Question answering
- POST /summarize: Text summarization
- WebSocket /chat: Real-time chat interface
Testing the Server:¶
# Test health endpoint
curl http://localhost:8000/health
# Test sentiment analysis
curl -X POST "http://localhost:8000/sentiment" \
-H "Content-Type: application/json" \
-d '{"texts": ["I love this product!", "This is terrible"], "batch_size": 2}'
# Test question answering
curl -X POST "http://localhost:8000/qa" \
-H "Content-Type: application/json" \
-d '{"questions": ["What is AI?"], "contexts": ["Artificial Intelligence is the simulation of human intelligence."]}'
### 3. ๐ Load Testing & Performance Validation
Use the <mcfile name="jetson_nlp_toolkit.py" path="jetson/jetson_nlp_toolkit.py"></mcfile> for comprehensive load testing:
```bash
# Run load test on running server
python jetson_nlp_toolkit.py loadtest --url http://localhost:8000
# Custom load test parameters
python jetson_nlp_toolkit.py loadtest --url http://localhost:8000 --concurrent 20 --requests 500
# Test specific endpoints
python jetson_nlp_toolkit.py loadtest --url http://localhost:8000 --endpoint sentiment
python jetson_nlp_toolkit.py loadtest --url http://localhost:8000 --endpoint qa
# Save load test results
python jetson_nlp_toolkit.py loadtest --url http://localhost:8000 --output load_test_results.json
The load tester provides: - Concurrent Testing: Simulates multiple users - Endpoint Coverage: Tests all API endpoints - Performance Metrics: Latency, throughput, success rate - Error Analysis: Detailed failure reporting - Results Export: JSON format for further analysis
---
## ๐ฏ Final Challenge: Complete NLP Pipeline on Jetson
### ๐ Challenge Objective
Build a complete, production-ready NLP pipeline that processes real-world data and demonstrates all optimization techniques learned in this tutorial.
### ๐ Challenge Requirements
#### 1. **Multi-Modal NLP System**
Implement a system that handles:
- **Text Classification** (sentiment analysis on product reviews)
- **Information Extraction** (NER on news articles)
- **Question Answering** (FAQ system for customer support)
- **Text Summarization** (news article summarization)
- **Real-time Chat** (customer service chatbot)
#### 2. **Optimization Implementation**
- Apply **quantization** (FP16 minimum, INT8 preferred)
- Implement **dynamic batching** for throughput optimization
- Use **model pruning** to reduce memory footprint
- Deploy with **TensorRT** optimization where possible
- Implement **caching** for frequently requested content
#### 3. **Production Deployment**
- **Containerized deployment** with multi-stage Docker builds
- **REST API** with proper error handling and logging
- **Load balancing** for high availability
- **Monitoring and metrics** collection
- **Auto-scaling** based on resource utilization
#### 4. **Performance Benchmarking**
- **Latency analysis** (P50, P95, P99 percentiles)
- **Throughput measurement** (requests per second)
- **Resource utilization** (GPU/CPU/memory usage)
- **Accuracy validation** on standard datasets
- **Cost analysis** (inference cost per request)
### ๐ ๏ธ Implementation Guide
Use the comprehensive <mcfile name="jetson_nlp_toolkit.py" path="jetson/jetson_nlp_toolkit.py"></mcfile> to implement the complete challenge:
```bash
# Run comprehensive evaluation across all NLP tasks
python jetson_nlp_toolkit.py evaluate --all-tasks --save-results challenge_evaluation.json
# Apply all optimization techniques
python jetson_nlp_toolkit.py optimize --all-methods --model distilbert-base-uncased-finetuned-sst-2-english
# Deploy production server with monitoring
python jetson_nlp_toolkit.py server --enable-monitoring --host 0.0.0.0 --port 8000
# Run comprehensive load testing
python jetson_nlp_toolkit.py loadtest --url http://localhost:8000 --concurrent 50 --requests 1000
# Compare LLM inference methods
python jetson_nlp_toolkit.py llm --all-methods --model microsoft/DialoGPT-small
The toolkit provides all necessary components: - Multi-task NLP Pipeline: Sentiment, NER, QA, Summarization, Chat - Optimization Techniques: Quantization, Pruning, Distillation, TensorRT - Production Deployment: FastAPI server with WebSocket support - Performance Monitoring: Real-time metrics and analysis - Load Testing: Comprehensive performance validation - Caching & Batching: Redis integration and dynamic batching
๐ Evaluation Criteria¶
Criterion | Weight | Excellent (90-100%) | Good (70-89%) | Satisfactory (50-69%) |
---|---|---|---|---|
Functionality | 25% | All 5 NLP tasks working perfectly | 4/5 tasks working | 3/5 tasks working |
Optimization | 25% | All optimization techniques applied | Most optimizations applied | Basic optimizations |
Performance | 20% | <50ms P95 latency, >100 req/s | <100ms P95, >50 req/s | <200ms P95, >20 req/s |
Production Ready | 15% | Full deployment with monitoring | Basic deployment | Local deployment only |
Code Quality | 10% | Clean, documented, tested | Well-structured | Basic implementation |
Innovation | 5% | Novel optimizations/features | Creative solutions | Standard implementation |
๐ฏ Bonus Challenges¶
- Multi-Language Support: Extend the pipeline to handle multiple languages
- Edge Deployment: Deploy on actual Jetson hardware with resource constraints
- Federated Learning: Implement model updates without centralized data
- Real-time Streaming: Process continuous data streams with low latency
- Custom Models: Train and deploy domain-specific models
๐ Summary¶
- Comprehensive NLP Applications: Covered 6 major NLP tasks with Jetson-specific optimizations
- Advanced Optimization Techniques: Quantization, pruning, distillation, and dynamic batching
- Production Deployment: Multi-stage Docker builds, FastAPI servers, and load testing
- Performance Monitoring: Real-time metrics collection and analysis
- Practical Evaluation: Standardized benchmarking on popular datasets
- Complete Pipeline: End-to-end solution from development to production
- All-in-One Toolkit: Unified command-line tool (
jetson_nlp_toolkit.py
) combining all examples and implementations
This tutorial provides a comprehensive foundation for deploying production-ready NLP applications on Jetson devices, balancing performance, accuracy, and resource efficiency. The included all-in-one toolkit makes it easy to experiment with different NLP tasks, optimization techniques, and deployment strategies using a simple command-line interface.