🧠 NLP Applications & LLM Optimization on Jetson¶

Author: Dr. Kaikai Liu, Ph.D.
Position: Associate Professor, Computer Engineering
Institution: San Jose State University
Contact: kaikai.liu@sjsu.edu

This tutorial covers essential techniques for deploying and optimizing NLP applications on Jetson devices. You'll learn about various optimization strategies, benchmark different NLP tasks, and implement production-ready solutions. All examples and implementations are combined into a single, easy-to-use command-line tool (jetson_nlp_toolkit.py) that you can use to experiment with different approaches.

🤖 Why Optimize LLMs on Jetson?¶

Jetson Orin Nano has limited power and memory (e.g., 8GB), so optimizing models for:

💾 Lower memory usage
⚡ Faster inference latency
🔌 Better energy efficiency

Enables real-time NLP applications at the edge.

🚀 Optimization Strategies¶

✅ 1. Model Quantization¶

Quantization reduces the precision of model weights (e.g., FP32 → INT8 or Q4) to shrink size and improve inference speed.

🔍 What is Q4_K_M?¶

Q4 = 4-bit quantization (16x smaller than FP32)
K = Grouped quantization for accuracy preservation
M = Variant with optimized metadata handling

Q4_K_M is commonly used in llama.cpp for best quality/speed tradeoff on Jetson.

✅ 2. Use Smaller or Distilled Models¶

Distillation creates smaller models (e.g., DistilBERT) by mimicking larger models while reducing parameters.

Faster and lighter than full LLMs

✅ 3. Use TensorRT or ONNX for Inference¶

Export HuggingFace or PyTorch models to ONNX and use:

onnxruntime-gpu
TensorRT engines (for low latency and reduced memory use)

✅ 4. Offload Selected Layers¶

For large models, tools like llama-cpp-python allow setting n_gpu_layers to control how many transformer layers use GPU vs CPU.

📊 NLP Application Evaluation Labs¶

🛠️ All-in-One NLP Toolkit¶

We've created a comprehensive Python script that combines all the NLP applications, optimization techniques, and evaluation methods from this tutorial into a single command-line tool.

<!-- #### Installation

# Clone the repository if you haven't already
git clone https://github.com/yourusername/edgeAI.git
cd edgeAI

# Install dependencies
pip install torch transformers datasets evaluate rouge-score fastapi uvicorn aiohttp psutil matplotlib numpy

# Optional dependencies for specific features
pip install redis llama-cpp-python requests
``` -->

#### Usage

The toolkit provides several commands for different NLP tasks and optimizations:

```bash
python jetson_nlp_toolkit.py [command] [options]

Available commands: - evaluate: Run NLP evaluation suite - optimize: Run optimization benchmarks - llm: Compare LLM inference methods - server: Run NLP server - loadtest: Run load tests against NLP server

Examples¶

1. Evaluate NLP Tasks:

# Run full evaluation suite
python jetson_nlp_toolkit.py evaluate --task all

# Evaluate specific task (sentiment, qa, summarization, ner)
python jetson_nlp_toolkit.py evaluate --task sentiment

# Save results to custom file
python jetson_nlp_toolkit.py evaluate --task all --output my_results.json

2. Benchmark Optimization Techniques:

# Compare quantization methods
python jetson_nlp_toolkit.py optimize --method quantization

# Test model pruning
python jetson_nlp_toolkit.py optimize --method pruning --ratio 0.3

# Use custom model
python jetson_nlp_toolkit.py optimize --method all --model bert-base-uncased

3. Compare LLM Inference Methods:

# Test all inference methods
python jetson_nlp_toolkit.py llm --method all

# Test specific method with custom model
python jetson_nlp_toolkit.py llm --method huggingface --model gpt2

# Test llama.cpp with custom model path
python jetson_nlp_toolkit.py llm --method llamacpp --model-path /path/to/model.gguf

4. Run NLP Server:

# Start server on default port (8000)
python jetson_nlp_toolkit.py server

# Specify host and port
python jetson_nlp_toolkit.py server --host 127.0.0.1 --port 5000

5. Run Load Tests:

# Test server with default settings
python jetson_nlp_toolkit.py loadtest

# Custom test configuration
python jetson_nlp_toolkit.py loadtest --url http://localhost:8000 --concurrent 20 --requests 500

🧪 Lab 1: Multi-Application NLP Benchmark Suite¶

Objective: Evaluate and compare different NLP applications on Jetson using standardized datasets

Setup Evaluation Environment¶

# Create evaluation container
docker run --rm -it --runtime nvidia \
  -v $(pwd)/nlp_eval:/workspace \
  -v $(pwd)/datasets:/datasets \
  nvcr.io/nvidia/pytorch:24.04-py3 /bin/bash

# Install evaluation dependencies
pip install transformers datasets evaluate rouge-score sacrebleu spacy
python -m spacy download en_core_web_sm

Run Comprehensive Evaluation¶

Use the compiled)

Ollama Container:¶

docker run --rm -it --network host \
  -v ollama:/root/.ollama ollama/ollama

🔁 Run LLM Inference Comparison¶

Use the to run NLP evaluations:

# Run all NLP evaluations
python jetson_nlp_toolkit.py evaluate --task all

# Run specific task evaluations
python jetson_nlp_toolkit.py evaluate --task sentiment
python jetson_nlp_toolkit.py evaluate --task qa
python jetson_nlp_toolkit.py evaluate --task summarization
python jetson_nlp_toolkit.py evaluate --task ner

# Save results to custom file
python jetson_nlp_toolkit.py evaluate --output my_results.json

The evaluation suite includes: - Sentiment Analysis: IMDB dataset with DistilBERT - Question Answering: SQuAD dataset with DistilBERT - Text Summarization: CNN/DailyMail with T5-small - Named Entity Recognition: CoNLL-2003 with BERT-large

Each evaluation measures: - Accuracy/F1 scores - Latency and throughput - GPU memory usage - CPU utilization

🧪 Lab 2: Advanced Optimization Techniques¶

Objective: Implement and compare advanced optimization strategies for NLP models on Jetson

Run Optimization Benchmarks¶

Use the mory usage - Model loading time - Response quality

📊 Record Results¶

Method	Latency (s)	Tokens/sec	GPU Mem (MB)
HuggingFace PyTorch
llama-cpp-python
Ollama REST API

Use tegrastats or jtop to observe GPU memory and CPU usage during inference.

📋 Lab Deliverables¶

For Lab 1 (Multi-Application Benchmark):¶

Completed evaluation results JSON file
Performance comparison charts for all NLP tasks
Analysis report identifying best models for each task on Jetson
Resource utilization graphs (tegrastats screenshots)

For Lab 2 (Optimization Techniques):¶

Quantization comparison table
Memory usage analysis
Speedup and compression ratio calculations
Recommendations for production deployment

For Lab 3 (LLM Container Comparison):¶

Completed benchmark table
Screenshots of tegrastats during inference
Analysis: Which approach is fastest, lightest, and most accurate for Jetson?

🎯 Advanced NLP Optimization Strategies¶

1. 🔧 Model Pruning for Jetson¶

# model_pruning.py
import torch
import torch.nn.utils.prune as prune
from transformers import AutoModelForSequenceClassification

def prune_model(model, pruning_ratio=0.2):
    """Apply structured pruning to transformer model"""
    for name, module in model.named_modules():
        if isinstance(module, torch.nn.Linear):
            prune.l1_unstructured(module, name='weight', amount=pruning_ratio)
            prune.remove(module, 'weight')

    return model

# Example usage
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
pruned_model = prune_model(model, pruning_ratio=0.3)

2. 🚀 Knowledge Distillation¶

# knowledge_distillation.py
import torch
import torch.nn as nn
from transformers import AutoModelForSequenceClassification, AutoTokenizer

class DistillationTrainer:
    def __init__(self, teacher_model, student_model, temperature=3.0, alpha=0.7):
        self.teacher = teacher_model
        self.student = student_model
        self.temperature = temperature
        self.alpha = alpha
        self.kl_loss = nn.KLDivLoss(reduction='batchmean')
        self.ce_loss = nn.CrossEntropyLoss()

    def distillation_loss(self, student_logits, teacher_logits, labels):
        """Calculate distillation loss"""
        # Soft targets from teacher
        teacher_probs = torch.softmax(teacher_logits / self.temperature, dim=1)
        student_log_probs = torch.log_softmax(student_logits / self.temperature, dim=1)

        # Distillation loss
        distill_loss = self.kl_loss(student_log_probs, teacher_probs) * (self.temperature ** 2)

        # Hard target loss
        hard_loss = self.ce_loss(student_logits, labels)

        # Combined loss
        total_loss = self.alpha * distill_loss + (1 - self.alpha) * hard_loss
        return total_loss

3. 🔄 Dynamic Batching for Real-time Inference¶

# dynamic_batching.py
import asyncio
import time
from collections import deque
from typing import List, Tuple

class DynamicBatcher:
    def __init__(self, model, tokenizer, max_batch_size=8, max_wait_time=0.1):
        self.model = model
        self.tokenizer = tokenizer
        self.max_batch_size = max_batch_size
        self.max_wait_time = max_wait_time
        self.request_queue = deque()
        self.processing = False

    async def add_request(self, text: str) -> str:
        """Add inference request to queue"""
        future = asyncio.Future()
        self.request_queue.append((text, future))

        if not self.processing:
            asyncio.create_task(self.process_batch())

        return await future

    async def process_batch(self):
        """Process requests in batches"""
        self.processing = True

        while self.request_queue:
            batch = []
            futures = []
            start_time = time.time()

            # Collect batch
            while (len(batch) < self.max_batch_size and 
                   self.request_queue and 
                   (time.time() - start_time) < self.max_wait_time):

                text, future = self.request_queue.popleft()
                batch.append(text)
                futures.append(future)

                if not self.request_queue:
                    await asyncio.sleep(0.01)  # Small wait for more requests

            if batch:
                # Process batch
                results = await self.inference_batch(batch)

                # Return results
                for future, result in zip(futures, results):
                    future.set_result(result)

        self.processing = False

    async def inference_batch(self, texts: List[str]) -> List[str]:
        """Run inference on batch"""
        inputs = self.tokenizer(
            texts, 
            return_tensors="pt", 
            padding=True, 
            truncation=True,
            max_length=512
        )

        with torch.no_grad():
            outputs = self.model(**inputs)
            predictions = torch.argmax(outputs.logits, dim=-1)

        return [f"Prediction: {pred.item()}" for pred in predictions]

4. 📊 Real-time Performance Monitoring¶

# performance_monitor.py
import time
import psutil
import torch
from collections import deque
import matplotlib.pyplot as plt
from threading import Thread

class JetsonNLPMonitor:
    def __init__(self, window_size=100):
        self.window_size = window_size
        self.metrics = {
            'latency': deque(maxlen=window_size),
            'throughput': deque(maxlen=window_size),
            'gpu_memory': deque(maxlen=window_size),
            'cpu_usage': deque(maxlen=window_size),
            'timestamps': deque(maxlen=window_size)
        }
        self.monitoring = False

    def start_monitoring(self):
        """Start background monitoring"""
        self.monitoring = True
        monitor_thread = Thread(target=self._monitor_loop)
        monitor_thread.daemon = True
        monitor_thread.start()

    def stop_monitoring(self):
        """Stop monitoring"""
        self.monitoring = False

    def _monitor_loop(self):
        """Background monitoring loop"""
        while self.monitoring:
            timestamp = time.time()

            # GPU memory
            if torch.cuda.is_available():
                gpu_memory = torch.cuda.memory_allocated() / 1024 / 1024  # MB
            else:
                gpu_memory = 0

            # CPU usage
            cpu_usage = psutil.cpu_percent()

            self.metrics['gpu_memory'].append(gpu_memory)
            self.metrics['cpu_usage'].append(cpu_usage)
            self.metrics['timestamps'].append(timestamp)

            time.sleep(0.1)  # Monitor every 100ms

    def log_inference(self, latency, batch_size=1):
        """Log inference metrics"""
        self.metrics['latency'].append(latency * 1000)  # Convert to ms
        self.metrics['throughput'].append(batch_size / latency)  # samples/sec

    def get_stats(self):
        """Get current statistics"""
        if not self.metrics['latency']:
            return {}

        return {
            'avg_latency_ms': sum(self.metrics['latency']) / len(self.metrics['latency']),
            'avg_throughput': sum(self.metrics['throughput']) / len(self.metrics['throughput']),
            'avg_gpu_memory_mb': sum(self.metrics['gpu_memory']) / len(self.metrics['gpu_memory']),
            'avg_cpu_usage': sum(self.metrics['cpu_usage']) / len(self.metrics['cpu_usage']),
            'total_inferences': len(self.metrics['latency'])
        }

    def plot_metrics(self, save_path="nlp_performance.png"):
        """Plot performance metrics"""
        fig, axes = plt.subplots(2, 2, figsize=(12, 8))

        # Latency
        axes[0, 0].plot(list(self.metrics['latency']))
        axes[0, 0].set_title('Inference Latency (ms)')
        axes[0, 0].set_ylabel('Latency (ms)')

        # Throughput
        axes[0, 1].plot(list(self.metrics['throughput']))
        axes[0, 1].set_title('Throughput (samples/sec)')
        axes[0, 1].set_ylabel('Samples/sec')

        # GPU Memory
        axes[1, 0].plot(list(self.metrics['gpu_memory']))
        axes[1, 0].set_title('GPU Memory Usage (MB)')
        axes[1, 0].set_ylabel('Memory (MB)')

        # CPU Usage
        axes[1, 1].plot(list(self.metrics['cpu_usage']))
        axes[1, 1].set_title('CPU Usage (%)')
        axes[1, 1].set_ylabel('CPU %')

        plt.tight_layout()
        plt.savefig(save_path)
        plt.show()

        print(f"📊 Performance plots saved to {save_path}")

🧪 Bonus Lab: Export HuggingFace → ONNX → TensorRT¶

Export:

import torch
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
dummy = torch.randint(0, 100, (1, 64))
torch.onnx.export(model, (dummy,), "model.onnx", input_names=["input_ids"])

Convert:

trtexec --onnx=model.onnx --saveEngine=model.trt

Run using TensorRT Python bindings or onnxruntime-gpu

🚀 Production Deployment Strategies¶

1. 🐳 Multi-Stage Docker Optimization¶

# Dockerfile.nlp-production
# Multi-stage build for optimized NLP deployment
FROM nvcr.io/nvidia/pytorch:24.04-py3 as builder

# Install build dependencies
RUN pip install transformers torch-audio torchaudio torchvision
RUN pip install onnx onnxruntime-gpu tensorrt

# Copy and optimize models
COPY models/ /tmp/models/
COPY scripts/optimize_models.py /tmp/
RUN python /tmp/optimize_models.py

# Production stage
FROM nvcr.io/nvidia/pytorch:24.04-py3

# Install only runtime dependencies
RUN pip install --no-cache-dir \
    transformers==4.36.0 \
    torch==2.1.0 \
    onnxruntime-gpu==1.16.0 \
    fastapi==0.104.0 \
    uvicorn==0.24.0

# Copy optimized models
COPY --from=builder /tmp/optimized_models/ /app/models/
COPY src/ /app/src/

WORKDIR /app
EXPOSE 8000

CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000"]

2. 🌐 FastAPI Production Server¶

Use the to compare different optimization methods:

# Run all optimization benchmarks
python jetson_nlp_toolkit.py optimize --method all

# Run specific optimization methods
python jetson_nlp_toolkit.py optimize --method quantization
python jetson_nlp_toolkit.py optimize --method pruning --ratio 0.3
python jetson_nlp_toolkit.py optimize --method distillation

# Benchmark with custom model
python jetson_nlp_toolkit.py optimize --model "bert-base-uncased" --samples 100

# Save optimization results
python jetson_nlp_toolkit.py optimize --output optimization_results.json

The optimization benchmark compares: - FP32: Full precision baseline - FP16: Half precision for GPU acceleration - INT8 Dynamic: Dynamic quantization for CPU - TorchScript: Graph optimization - Model Pruning: Structured weight pruning - Knowledge Distillation: Teacher-student training

Metrics measured: - Average latency per sample - Throughput (samples/second) - Memory usage - Model size compression - Speedup vs baseline

🧪 Lab 3: Compare LLM Inference in Containers¶

🎯 Objective¶

Evaluate inference speed and memory usage for different LLM deployment methods on Jetson inside Docker containers.

🔧 Setup Container for Each Method¶

HuggingFace Container:¶

docker run --rm -it --runtime nvidia \
  -v $(pwd):/workspace \
  nvcr.io/nvidia/pytorch:24.04-py3 /bin/bash

Inside container:

pip install transformers accelerate torch

llama.cpp Container:¶

docker run --rm -it --runtime nvidia \
  -v $(pwd)/models:/models \
  jetson-llama-cpp /bin/bash

(Assumes container has CUDA + llama.cpp compiled)

Ollama Container:¶

docker run --rm -it --network host \
  -v ollama:/root/.ollama ollama/ollama

🔁 Run LLM Inference Comparison¶

Use the ime Performance Monitoring

# performance_monitor.py
import time
import psutil
import torch
from collections import deque
import matplotlib.pyplot as plt
from threading import Thread

class JetsonNLPMonitor:
    def __init__(self, window_size=100):
        self.window_size = window_size
        self.metrics = {
            'latency': deque(maxlen=window_size),
            'throughput': deque(maxlen=window_size),
            'gpu_memory': deque(maxlen=window_size),
            'cpu_usage': deque(maxlen=window_size),
            'timestamps': deque(maxlen=window_size)
        }
        self.monitoring = False

    def start_monitoring(self):
        """Start background monitoring"""
        self.monitoring = True
        monitor_thread = Thread(target=self._monitor_loop)
        monitor_thread.daemon = True
        monitor_thread.start()

    def stop_monitoring(self):
        """Stop monitoring"""
        self.monitoring = False

    def _monitor_loop(self):
        """Background monitoring loop"""
        while self.monitoring:
            timestamp = time.time()

            # GPU memory
            if torch.cuda.is_available():
                gpu_memory = torch.cuda.memory_allocated() / 1024 / 1024  # MB
            else:
                gpu_memory = 0

            # CPU usage
            cpu_usage = psutil.cpu_percent()

            self.metrics['gpu_memory'].append(gpu_memory)
            self.metrics['cpu_usage'].append(cpu_usage)
            self.metrics['timestamps'].append(timestamp)

            time.sleep(0.1)  # Monitor every 100ms

    def log_inference(self, latency, batch_size=1):
        """Log inference metrics"""
        self.metrics['latency'].append(latency * 1000)  # Convert to ms
        self.metrics['throughput'].append(batch_size / latency)  # samples/sec

    def get_stats(self):
        """Get current statistics"""
        if not self.metrics['latency']:
            return {}

        return {
            'avg_latency_ms': sum(self.metrics['latency']) / len(self.metrics['latency']),
            'avg_throughput': sum(self.metrics['throughput']) / len(self.metrics['throughput']),
            'avg_gpu_memory_mb': sum(self.metrics['gpu_memory']) / len(self.metrics['gpu_memory']),
            'avg_cpu_usage': sum(self.metrics['cpu_usage']) / len(self.metrics['cpu_usage']),
            'total_inferences': len(self.metrics['latency'])
        }

    def plot_metrics(self, save_path="nlp_performance.png"):
        """Plot performance metrics"""
        fig, axes = plt.subplots(2, 2, figsize=(12, 8))

        # Latency
        axes[0, 0].plot(list(self.metrics['latency']))
        axes[0, 0].set_title('Inference Latency (ms)')
        axes[0, 0].set_ylabel('Latency (ms)')

        # Throughput
        axes[0, 1].plot(list(self.metrics['throughput']))
        axes[0, 1].set_title('Throughput (samples/sec)')
        axes[0, 1].set_ylabel('Samples/sec')

        # GPU Memory
        axes[1, 0].plot(list(self.metrics['gpu_memory']))
        axes[1, 0].set_title('GPU Memory Usage (MB)')
        axes[1, 0].set_ylabel('Memory (MB)')

        # CPU Usage
        axes[1, 1].plot(list(self.metrics['cpu_usage']))
        axes[1, 1].set_title('CPU Usage (%)')
        axes[1, 1].set_ylabel('CPU %')

        plt.tight_layout()
        plt.savefig(save_path)
        plt.show()

        print(f"📊 Performance plots saved to {save_path}")

🧪 Bonus Lab: Export HuggingFace → ONNX → TensorRT¶

Export:

import torch
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
dummy = torch.randint(0, 100, (1, 64))
torch.onnx.export(model, (dummy,), "model.onnx", input_names=["input_ids"])

Convert:

trtexec --onnx=model.onnx --saveEngine=model.trt

Run using TensorRT Python bindings or onnxruntime-gpu

🚀 Production Deployment Strategies¶

1. 🐳 Multi-Stage Docker Optimization¶

# Dockerfile.nlp-production
# Multi-stage build for optimized NLP deployment
FROM nvcr.io/nvidia/pytorch:24.04-py3 as builder

# Install build dependencies
RUN pip install transformers torch-audio torchaudio torchvision
RUN pip install onnx onnxruntime-gpu tensorrt

# Copy and optimize models
COPY models/ /tmp/models/
COPY scripts/optimize_models.py /tmp/
RUN python /tmp/optimize_models.py

# Production stage
FROM nvcr.io/nvidia/pytorch:24.04-py3

# Install only runtime dependencies
RUN pip install --no-cache-dir \
    transformers==4.36.0 \
    torch==2.1.0 \
    onnxruntime-gpu==1.16.0 \
    fastapi==0.104.0 \
    uvicorn==0.24.0

# Copy optimized models
COPY --from=builder /tmp/optimized_models/ /app/models/
COPY src/ /app/src/

WORKDIR /app
EXPOSE 8000

CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000"]

2. 🌐 FastAPI Production Server¶

Use the to compare different LLM inference methods:

# Compare all available LLM inference methods
python jetson_nlp_toolkit.py llm --method all

# Test specific inference methods
python jetson_nlp_toolkit.py llm --method huggingface --model "microsoft/DialoGPT-small"
python jetson_nlp_toolkit.py llm --method llamacpp --model-path "/models/mistral.gguf"
python jetson_nlp_toolkit.py llm --method ollama --model "mistral"

# Custom prompts and settings
python jetson_nlp_toolkit.py llm --prompts "Explain the future of AI in education." --max-tokens 100

# Save comparison results
python jetson_nlp_toolkit.py llm --output llm_comparison_results.json

The LLM comparison evaluates: - HuggingFace Transformers: GPU-optimized inference - llama.cpp: CPU-optimized quantized models
- Ollama API: Containerized LLM serving

Metrics measured: - Average latency per prompt - Tokens generated per second - Memory usage - Model loading time - Response quality

📊 Record Results¶

Method	Latency (s)	Tokens/sec	GPU Mem (MB)
HuggingFace PyTorch
llama-cpp-python
Ollama REST API

Use tegrastats or jtop to observe GPU memory and CPU usage during inference.

📋 Lab Deliverables¶

For Lab 1 (Multi-Application Benchmark):¶

Completed evaluation results JSON file
Performance comparison charts for all NLP tasks
Analysis report identifying best models for each task on Jetson
Resource utilization graphs (tegrastats screenshots)

For Lab 2 (Optimization Techniques):¶

Quantization comparison table
Memory usage analysis
Speedup and compression ratio calculations
Recommendations for production deployment

For Lab 3 (LLM Container Comparison):¶

Completed benchmark table
Screenshots of tegrastats during inference
Analysis: Which approach is fastest, lightest, and most accurate for Jetson?

🎯 Advanced NLP Optimization Strategies¶

1. 🔧 Model Pruning for Jetson¶

# model_pruning.py
import torch
import torch.nn.utils.prune as prune
from transformers import AutoModelForSequenceClassification

def prune_model(model, pruning_ratio=0.2):
    """Apply structured pruning to transformer model"""
    for name, module in model.named_modules():
        if isinstance(module, torch.nn.Linear):
            prune.l1_unstructured(module, name='weight', amount=pruning_ratio)
            prune.remove(module, 'weight')

    return model

# Example usage
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
pruned_model = prune_model(model, pruning_ratio=0.3)

2. 🚀 Knowledge Distillation¶

# knowledge_distillation.py
import torch
import torch.nn as nn
from transformers import AutoModelForSequenceClassification, AutoTokenizer

class DistillationTrainer:
    def __init__(self, teacher_model, student_model, temperature=3.0, alpha=0.7):
        self.teacher = teacher_model
        self.student = student_model
        self.temperature = temperature
        self.alpha = alpha
        self.kl_loss = nn.KLDivLoss(reduction='batchmean')
        self.ce_loss = nn.CrossEntropyLoss()

    def distillation_loss(self, student_logits, teacher_logits, labels):
        """Calculate distillation loss"""
        # Soft targets from teacher
        teacher_probs = torch.softmax(teacher_logits / self.temperature, dim=1)
        student_log_probs = torch.log_softmax(student_logits / self.temperature, dim=1)

        # Distillation loss
        distill_loss = self.kl_loss(student_log_probs, teacher_probs) * (self.temperature ** 2)

        # Hard target loss
        hard_loss = self.ce_loss(student_logits, labels)

        # Combined loss
        total_loss = self.alpha * distill_loss + (1 - self.alpha) * hard_loss
        return total_loss

3. 🔄 Dynamic Batching for Real-time Inference¶

# dynamic_batching.py
import asyncio
import time
from collections import deque
from typing import List, Tuple

class DynamicBatcher:
    def __init__(self, model, tokenizer, max_batch_size=8, max_wait_time=0.1):
        self.model = model
        self.tokenizer = tokenizer
        self.max_batch_size = max_batch_size
        self.max_wait_time = max_wait_time
        self.request_queue = deque()
        self.processing = False

    async def add_request(self, text: str) -> str:
        """Add inference request to queue"""
        future = asyncio.Future()
        self.request_queue.append((text, future))

        if not self.processing:
            asyncio.create_task(self.process_batch())

        return await future

    async def process_batch(self):
        """Process requests in batches"""
        self.processing = True

        while self.request_queue:
            batch = []
            futures = []
            start_time = time.time()

            # Collect batch
            while (len(batch) < self.max_batch_size and 
                   self.request_queue and 
                   (time.time() - start_time) < self.max_wait_time):

                text, future = self.request_queue.popleft()
                batch.append(text)
                futures.append(future)

                if not self.request_queue:
                    await asyncio.sleep(0.01)  # Small wait for more requests

            if batch:
                # Process batch
                results = await self.inference_batch(batch)

                # Return results
                for future, result in zip(futures, results):
                    future.set_result(result)

        self.processing = False

    async def inference_batch(self, texts: List[str]) -> List[str]:
        """Run inference on batch"""
        inputs = self.tokenizer(
            texts, 
            return_tensors="pt", 
            padding=True, 
            truncation=True,
            max_length=512
        )

        with torch.no_grad():
            outputs = self.model(**inputs)
            predictions = torch.argmax(outputs.logits, dim=-1)

        return [f"Prediction: {pred.item()}" for pred in predictions]

4. 📊 Real-time Performance Monitoring¶

# performance_monitor.py
import time
import psutil
import torch
from collections import deque
import matplotlib.pyplot as plt
from threading import Thread

class JetsonNLPMonitor:
    def __init__(self, window_size=100):
        self.window_size = window_size
        self.metrics = {
            'latency': deque(maxlen=window_size),
            'throughput': deque(maxlen=window_size),
            'gpu_memory': deque(maxlen=window_size),
            'cpu_usage': deque(maxlen=window_size),
            'timestamps': deque(maxlen=window_size)
        }
        self.monitoring = False

    def start_monitoring(self):
        """Start background monitoring"""
        self.monitoring = True
        monitor_thread = Thread(target=self._monitor_loop)
        monitor_thread.daemon = True
        monitor_thread.start()

    def stop_monitoring(self):
        """Stop monitoring"""
        self.monitoring = False

    def _monitor_loop(self):
        """Background monitoring loop"""
        while self.monitoring:
            timestamp = time.time()

            # GPU memory
            if torch.cuda.is_available():
                gpu_memory = torch.cuda.memory_allocated() / 1024 / 1024  # MB
            else:
                gpu_memory = 0

            # CPU usage
            cpu_usage = psutil.cpu_percent()

            self.metrics['gpu_memory'].append(gpu_memory)
            self.metrics['cpu_usage'].append(cpu_usage)
            self.metrics['timestamps'].append(timestamp)

            time.sleep(0.1)  # Monitor every 100ms

    def log_inference(self, latency, batch_size=1):
        """Log inference metrics"""
        self.metrics['latency'].append(latency * 1000)  # Convert to ms
        self.metrics['throughput'].append(batch_size / latency)  # samples/sec

    def get_stats(self):
        """Get current statistics"""
        if not self.metrics['latency']:
            return {}

        return {
            'avg_latency_ms': sum(self.metrics['latency']) / len(self.metrics['latency']),
            'avg_throughput': sum(self.metrics['throughput']) / len(self.metrics['throughput']),
            'avg_gpu_memory_mb': sum(self.metrics['gpu_memory']) / len(self.metrics['gpu_memory']),
            'avg_cpu_usage': sum(self.metrics['cpu_usage']) / len(self.metrics['cpu_usage']),
            'total_inferences': len(self.metrics['latency'])
        }

    def plot_metrics(self, save_path="nlp_performance.png"):
        """Plot performance metrics"""
        fig, axes = plt.subplots(2, 2, figsize=(12, 8))

        # Latency
        axes[0, 0].plot(list(self.metrics['latency']))
        axes[0, 0].set_title('Inference Latency (ms)')
        axes[0, 0].set_ylabel('Latency (ms)')

        # Throughput
        axes[0, 1].plot(list(self.metrics['throughput']))
        axes[0, 1].set_title('Throughput (samples/sec)')
        axes[0, 1].set_ylabel('Samples/sec')

        # GPU Memory
        axes[1, 0].plot(list(self.metrics['gpu_memory']))
        axes[1, 0].set_title('GPU Memory Usage (MB)')
        axes[1, 0].set_ylabel('Memory (MB)')

        # CPU Usage
        axes[1, 1].plot(list(self.metrics['cpu_usage']))
        axes[1, 1].set_title('CPU Usage (%)')
        axes[1, 1].set_ylabel('CPU %')

        plt.tight_layout()
        plt.savefig(save_path)
        plt.show()

        print(f"📊 Performance plots saved to {save_path}")

🧪 Bonus Lab: Export HuggingFace → ONNX → TensorRT¶

Export:

import torch
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
dummy = torch.randint(0, 100, (1, 64))
torch.onnx.export(model, (dummy,), "model.onnx", input_names=["input_ids"])

Convert:

trtexec --onnx=model.onnx --saveEngine=model.trt

Run using TensorRT Python bindings or onnxruntime-gpu

🚀 Production Deployment Strategies¶

1. 🐳 Multi-Stage Docker Optimization¶

# Dockerfile.nlp-production
# Multi-stage build for optimized NLP deployment
FROM nvcr.io/nvidia/pytorch:24.04-py3 as builder

# Install build dependencies
RUN pip install transformers torch-audio torchaudio torchvision
RUN pip install onnx onnxruntime-gpu tensorrt

# Copy and optimize models
COPY models/ /tmp/models/
COPY scripts/optimize_models.py /tmp/
RUN python /tmp/optimize_models.py

# Production stage
FROM nvcr.io/nvidia/pytorch:24.04-py3

# Install only runtime dependencies
RUN pip install --no-cache-dir \
    transformers==4.36.0 \
    torch==2.1.0 \
    onnxruntime-gpu==1.16.0 \
    fastapi==0.104.0 \
    uvicorn==0.24.0

# Copy optimized models
COPY --from=builder /tmp/optimized_models/ /app/models/
COPY src/ /app/src/

WORKDIR /app
EXPOSE 8000

CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000"]

2. 🌐 FastAPI Production Server¶

Use the to deploy a production-ready NLP server:

# Start production NLP server
python jetson_nlp_toolkit.py server --host 0.0.0.0 --port 8000

# Start with custom models
python jetson_nlp_toolkit.py server --sentiment-model "cardiffnlp/twitter-roberta-base-sentiment-latest"

# Start with monitoring enabled
python jetson_nlp_toolkit.py server --enable-monitoring

The production server includes: - FastAPI Framework: High-performance async API - Model Optimization: FP16 precision and GPU acceleration - Batch Processing: Efficient handling of multiple requests - Health Monitoring: System status and performance metrics - Error Handling: Robust error management and logging - WebSocket Support: Real-time chat interface

API Endpoints Available:¶

GET /health: System health and model status
POST /sentiment: Batch sentiment analysis
POST /qa: Question answering
POST /summarize: Text summarization
WebSocket /chat: Real-time chat interface

Testing the Server:¶

# Test health endpoint
curl http://localhost:8000/health

# Test sentiment analysis
curl -X POST "http://localhost:8000/sentiment" \
     -H "Content-Type: application/json" \
     -d '{"texts": ["I love this product!", "This is terrible"], "batch_size": 2}'

# Test question answering
curl -X POST "http://localhost:8000/qa" \
     -H "Content-Type: application/json" \
     -d '{"questions": ["What is AI?"], "contexts": ["Artificial Intelligence is the simulation of human intelligence."]}'

### 3. 📊 Load Testing & Performance Validation

Use the <mcfile name="jetson_nlp_toolkit.py" path="jetson/jetson_nlp_toolkit.py"></mcfile> for comprehensive load testing:

```bash
# Run load test on running server
python jetson_nlp_toolkit.py loadtest --url http://localhost:8000

# Custom load test parameters
python jetson_nlp_toolkit.py loadtest --url http://localhost:8000 --concurrent 20 --requests 500

# Test specific endpoints
python jetson_nlp_toolkit.py loadtest --url http://localhost:8000 --endpoint sentiment
python jetson_nlp_toolkit.py loadtest --url http://localhost:8000 --endpoint qa

# Save load test results
python jetson_nlp_toolkit.py loadtest --url http://localhost:8000 --output load_test_results.json

The load tester provides: - Concurrent Testing: Simulates multiple users - Endpoint Coverage: Tests all API endpoints - Performance Metrics: Latency, throughput, success rate - Error Analysis: Detailed failure reporting - Results Export: JSON format for further analysis

---

## 🎯 Final Challenge: Complete NLP Pipeline on Jetson

### 🏆 Challenge Objective

Build a complete, production-ready NLP pipeline that processes real-world data and demonstrates all optimization techniques learned in this tutorial.

### 📋 Challenge Requirements

#### 1. **Multi-Modal NLP System**
Implement a system that handles:
- **Text Classification** (sentiment analysis on product reviews)
- **Information Extraction** (NER on news articles)
- **Question Answering** (FAQ system for customer support)
- **Text Summarization** (news article summarization)
- **Real-time Chat** (customer service chatbot)

#### 2. **Optimization Implementation**
- Apply **quantization** (FP16 minimum, INT8 preferred)
- Implement **dynamic batching** for throughput optimization
- Use **model pruning** to reduce memory footprint
- Deploy with **TensorRT** optimization where possible
- Implement **caching** for frequently requested content

#### 3. **Production Deployment**
- **Containerized deployment** with multi-stage Docker builds
- **REST API** with proper error handling and logging
- **Load balancing** for high availability
- **Monitoring and metrics** collection
- **Auto-scaling** based on resource utilization

#### 4. **Performance Benchmarking**
- **Latency analysis** (P50, P95, P99 percentiles)
- **Throughput measurement** (requests per second)
- **Resource utilization** (GPU/CPU/memory usage)
- **Accuracy validation** on standard datasets
- **Cost analysis** (inference cost per request)

### 🛠️ Implementation Guide

Use the comprehensive <mcfile name="jetson_nlp_toolkit.py" path="jetson/jetson_nlp_toolkit.py"></mcfile> to implement the complete challenge:

```bash
# Run comprehensive evaluation across all NLP tasks
python jetson_nlp_toolkit.py evaluate --all-tasks --save-results challenge_evaluation.json

# Apply all optimization techniques
python jetson_nlp_toolkit.py optimize --all-methods --model distilbert-base-uncased-finetuned-sst-2-english

# Deploy production server with monitoring
python jetson_nlp_toolkit.py server --enable-monitoring --host 0.0.0.0 --port 8000

# Run comprehensive load testing
python jetson_nlp_toolkit.py loadtest --url http://localhost:8000 --concurrent 50 --requests 1000

# Compare LLM inference methods
python jetson_nlp_toolkit.py llm --all-methods --model microsoft/DialoGPT-small

The toolkit provides all necessary components: - Multi-task NLP Pipeline: Sentiment, NER, QA, Summarization, Chat - Optimization Techniques: Quantization, Pruning, Distillation, TensorRT - Production Deployment: FastAPI server with WebSocket support - Performance Monitoring: Real-time metrics and analysis - Load Testing: Comprehensive performance validation - Caching & Batching: Redis integration and dynamic batching

📊 Evaluation Criteria¶

Criterion	Weight	Excellent (90-100%)	Good (70-89%)	Satisfactory (50-69%)
Functionality	25%	All 5 NLP tasks working perfectly	4/5 tasks working	3/5 tasks working
Optimization	25%	All optimization techniques applied	Most optimizations applied	Basic optimizations
Performance	20%	<50ms P95 latency, >100 req/s	<100ms P95, >50 req/s	<200ms P95, >20 req/s
Production Ready	15%	Full deployment with monitoring	Basic deployment	Local deployment only
Code Quality	10%	Clean, documented, tested	Well-structured	Basic implementation
Innovation	5%	Novel optimizations/features	Creative solutions	Standard implementation

🎯 Bonus Challenges¶

Multi-Language Support: Extend the pipeline to handle multiple languages
Edge Deployment: Deploy on actual Jetson hardware with resource constraints
Federated Learning: Implement model updates without centralized data
Real-time Streaming: Process continuous data streams with low latency
Custom Models: Train and deploy domain-specific models

📌 Summary¶

Comprehensive NLP Applications: Covered 6 major NLP tasks with Jetson-specific optimizations
Advanced Optimization Techniques: Quantization, pruning, distillation, and dynamic batching
Production Deployment: Multi-stage Docker builds, FastAPI servers, and load testing
Performance Monitoring: Real-time metrics collection and analysis
Practical Evaluation: Standardized benchmarking on popular datasets
Complete Pipeline: End-to-end solution from development to production
All-in-One Toolkit: Unified command-line tool (jetson_nlp_toolkit.py) combining all examples and implementations

This tutorial provides a comprehensive foundation for deploying production-ready NLP applications on Jetson devices, balancing performance, accuracy, and resource efficiency. The included all-in-one toolkit makes it easy to experiment with different NLP tasks, optimization techniques, and deployment strategies using a simple command-line interface.