Skip to content

๐Ÿง  NLP Applications & LLM Optimization on Jetson

Author: Dr. Kaikai Liu, Ph.D.
Position: Associate Professor, Computer Engineering
Institution: San Jose State University
Contact: kaikai.liu@sjsu.edu

This tutorial covers essential techniques for deploying and optimizing NLP applications on Jetson devices. You'll learn about various optimization strategies, benchmark different NLP tasks, and implement production-ready solutions. All examples and implementations are combined into a single, easy-to-use command-line tool (jetson_nlp_toolkit.py) that you can use to experiment with different approaches.

๐Ÿค– Why Optimize LLMs on Jetson?

Jetson Orin Nano has limited power and memory (e.g., 8GB), so optimizing models for:

  • ๐Ÿ’พ Lower memory usage
  • โšก Faster inference latency
  • ๐Ÿ”Œ Better energy efficiency

Enables real-time NLP applications at the edge.


๐Ÿš€ Optimization Strategies

โœ… 1. Model Quantization

Quantization reduces the precision of model weights (e.g., FP32 โ†’ INT8 or Q4) to shrink size and improve inference speed.

๐Ÿ” What is Q4_K_M?

  • Q4 = 4-bit quantization (16x smaller than FP32)
  • K = Grouped quantization for accuracy preservation
  • M = Variant with optimized metadata handling

Q4_K_M is commonly used in llama.cpp for best quality/speed tradeoff on Jetson.

โœ… 2. Use Smaller or Distilled Models

Distillation creates smaller models (e.g., DistilBERT) by mimicking larger models while reducing parameters.

  • Faster and lighter than full LLMs

โœ… 3. Use TensorRT or ONNX for Inference

Export HuggingFace or PyTorch models to ONNX and use:

  • onnxruntime-gpu
  • TensorRT engines (for low latency and reduced memory use)

โœ… 4. Offload Selected Layers

For large models, tools like llama-cpp-python allow setting n_gpu_layers to control how many transformer layers use GPU vs CPU.


๐Ÿ“Š NLP Application Evaluation Labs

๐Ÿ› ๏ธ All-in-One NLP Toolkit

We've created a comprehensive Python script that combines all the NLP applications, optimization techniques, and evaluation methods from this tutorial into a single command-line tool.

<!-- #### Installation

# Clone the repository if you haven't already
git clone https://github.com/yourusername/edgeAI.git
cd edgeAI

# Install dependencies
pip install torch transformers datasets evaluate rouge-score fastapi uvicorn aiohttp psutil matplotlib numpy

# Optional dependencies for specific features
pip install redis llama-cpp-python requests
``` -->

#### Usage

The toolkit provides several commands for different NLP tasks and optimizations:

```bash
python jetson_nlp_toolkit.py [command] [options]

Available commands: - evaluate: Run NLP evaluation suite - optimize: Run optimization benchmarks - llm: Compare LLM inference methods - server: Run NLP server - loadtest: Run load tests against NLP server

Examples

1. Evaluate NLP Tasks:

# Run full evaluation suite
python jetson_nlp_toolkit.py evaluate --task all

# Evaluate specific task (sentiment, qa, summarization, ner)
python jetson_nlp_toolkit.py evaluate --task sentiment

# Save results to custom file
python jetson_nlp_toolkit.py evaluate --task all --output my_results.json

2. Benchmark Optimization Techniques:

# Compare quantization methods
python jetson_nlp_toolkit.py optimize --method quantization

# Test model pruning
python jetson_nlp_toolkit.py optimize --method pruning --ratio 0.3

# Use custom model
python jetson_nlp_toolkit.py optimize --method all --model bert-base-uncased

3. Compare LLM Inference Methods:

# Test all inference methods
python jetson_nlp_toolkit.py llm --method all

# Test specific method with custom model
python jetson_nlp_toolkit.py llm --method huggingface --model gpt2

# Test llama.cpp with custom model path
python jetson_nlp_toolkit.py llm --method llamacpp --model-path /path/to/model.gguf

4. Run NLP Server:

# Start server on default port (8000)
python jetson_nlp_toolkit.py server

# Specify host and port
python jetson_nlp_toolkit.py server --host 127.0.0.1 --port 5000

5. Run Load Tests:

# Test server with default settings
python jetson_nlp_toolkit.py loadtest

# Custom test configuration
python jetson_nlp_toolkit.py loadtest --url http://localhost:8000 --concurrent 20 --requests 500

๐Ÿงช Lab 1: Multi-Application NLP Benchmark Suite

Objective: Evaluate and compare different NLP applications on Jetson using standardized datasets

Setup Evaluation Environment

# Create evaluation container
docker run --rm -it --runtime nvidia \
  -v $(pwd)/nlp_eval:/workspace \
  -v $(pwd)/datasets:/datasets \
  nvcr.io/nvidia/pytorch:24.04-py3 /bin/bash

# Install evaluation dependencies
pip install transformers datasets evaluate rouge-score sacrebleu spacy
python -m spacy download en_core_web_sm

Run Comprehensive Evaluation

Use the compiled)

Ollama Container:

docker run --rm -it --network host \
  -v ollama:/root/.ollama ollama/ollama

๐Ÿ” Run LLM Inference Comparison

Use the to run NLP evaluations:

# Run all NLP evaluations
python jetson_nlp_toolkit.py evaluate --task all

# Run specific task evaluations
python jetson_nlp_toolkit.py evaluate --task sentiment
python jetson_nlp_toolkit.py evaluate --task qa
python jetson_nlp_toolkit.py evaluate --task summarization
python jetson_nlp_toolkit.py evaluate --task ner

# Save results to custom file
python jetson_nlp_toolkit.py evaluate --output my_results.json

The evaluation suite includes: - Sentiment Analysis: IMDB dataset with DistilBERT - Question Answering: SQuAD dataset with DistilBERT - Text Summarization: CNN/DailyMail with T5-small - Named Entity Recognition: CoNLL-2003 with BERT-large

Each evaluation measures: - Accuracy/F1 scores - Latency and throughput - GPU memory usage - CPU utilization


๐Ÿงช Lab 2: Advanced Optimization Techniques

Objective: Implement and compare advanced optimization strategies for NLP models on Jetson

Run Optimization Benchmarks

Use the mory usage - Model loading time - Response quality


๐Ÿ“Š Record Results

Method Latency (s) Tokens/sec GPU Mem (MB)
HuggingFace PyTorch
llama-cpp-python
Ollama REST API

Use tegrastats or jtop to observe GPU memory and CPU usage during inference.


๐Ÿ“‹ Lab Deliverables

For Lab 1 (Multi-Application Benchmark):

  • Completed evaluation results JSON file
  • Performance comparison charts for all NLP tasks
  • Analysis report identifying best models for each task on Jetson
  • Resource utilization graphs (tegrastats screenshots)

For Lab 2 (Optimization Techniques):

  • Quantization comparison table
  • Memory usage analysis
  • Speedup and compression ratio calculations
  • Recommendations for production deployment

For Lab 3 (LLM Container Comparison):

  • Completed benchmark table
  • Screenshots of tegrastats during inference
  • Analysis: Which approach is fastest, lightest, and most accurate for Jetson?

๐ŸŽฏ Advanced NLP Optimization Strategies

1. ๐Ÿ”ง Model Pruning for Jetson

# model_pruning.py
import torch
import torch.nn.utils.prune as prune
from transformers import AutoModelForSequenceClassification

def prune_model(model, pruning_ratio=0.2):
    """Apply structured pruning to transformer model"""
    for name, module in model.named_modules():
        if isinstance(module, torch.nn.Linear):
            prune.l1_unstructured(module, name='weight', amount=pruning_ratio)
            prune.remove(module, 'weight')

    return model

# Example usage
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
pruned_model = prune_model(model, pruning_ratio=0.3)

2. ๐Ÿš€ Knowledge Distillation

# knowledge_distillation.py
import torch
import torch.nn as nn
from transformers import AutoModelForSequenceClassification, AutoTokenizer

class DistillationTrainer:
    def __init__(self, teacher_model, student_model, temperature=3.0, alpha=0.7):
        self.teacher = teacher_model
        self.student = student_model
        self.temperature = temperature
        self.alpha = alpha
        self.kl_loss = nn.KLDivLoss(reduction='batchmean')
        self.ce_loss = nn.CrossEntropyLoss()

    def distillation_loss(self, student_logits, teacher_logits, labels):
        """Calculate distillation loss"""
        # Soft targets from teacher
        teacher_probs = torch.softmax(teacher_logits / self.temperature, dim=1)
        student_log_probs = torch.log_softmax(student_logits / self.temperature, dim=1)

        # Distillation loss
        distill_loss = self.kl_loss(student_log_probs, teacher_probs) * (self.temperature ** 2)

        # Hard target loss
        hard_loss = self.ce_loss(student_logits, labels)

        # Combined loss
        total_loss = self.alpha * distill_loss + (1 - self.alpha) * hard_loss
        return total_loss

3. ๐Ÿ”„ Dynamic Batching for Real-time Inference

# dynamic_batching.py
import asyncio
import time
from collections import deque
from typing import List, Tuple

class DynamicBatcher:
    def __init__(self, model, tokenizer, max_batch_size=8, max_wait_time=0.1):
        self.model = model
        self.tokenizer = tokenizer
        self.max_batch_size = max_batch_size
        self.max_wait_time = max_wait_time
        self.request_queue = deque()
        self.processing = False

    async def add_request(self, text: str) -> str:
        """Add inference request to queue"""
        future = asyncio.Future()
        self.request_queue.append((text, future))

        if not self.processing:
            asyncio.create_task(self.process_batch())

        return await future

    async def process_batch(self):
        """Process requests in batches"""
        self.processing = True

        while self.request_queue:
            batch = []
            futures = []
            start_time = time.time()

            # Collect batch
            while (len(batch) < self.max_batch_size and 
                   self.request_queue and 
                   (time.time() - start_time) < self.max_wait_time):

                text, future = self.request_queue.popleft()
                batch.append(text)
                futures.append(future)

                if not self.request_queue:
                    await asyncio.sleep(0.01)  # Small wait for more requests

            if batch:
                # Process batch
                results = await self.inference_batch(batch)

                # Return results
                for future, result in zip(futures, results):
                    future.set_result(result)

        self.processing = False

    async def inference_batch(self, texts: List[str]) -> List[str]:
        """Run inference on batch"""
        inputs = self.tokenizer(
            texts, 
            return_tensors="pt", 
            padding=True, 
            truncation=True,
            max_length=512
        )

        with torch.no_grad():
            outputs = self.model(**inputs)
            predictions = torch.argmax(outputs.logits, dim=-1)

        return [f"Prediction: {pred.item()}" for pred in predictions]

4. ๐Ÿ“Š Real-time Performance Monitoring

# performance_monitor.py
import time
import psutil
import torch
from collections import deque
import matplotlib.pyplot as plt
from threading import Thread

class JetsonNLPMonitor:
    def __init__(self, window_size=100):
        self.window_size = window_size
        self.metrics = {
            'latency': deque(maxlen=window_size),
            'throughput': deque(maxlen=window_size),
            'gpu_memory': deque(maxlen=window_size),
            'cpu_usage': deque(maxlen=window_size),
            'timestamps': deque(maxlen=window_size)
        }
        self.monitoring = False

    def start_monitoring(self):
        """Start background monitoring"""
        self.monitoring = True
        monitor_thread = Thread(target=self._monitor_loop)
        monitor_thread.daemon = True
        monitor_thread.start()

    def stop_monitoring(self):
        """Stop monitoring"""
        self.monitoring = False

    def _monitor_loop(self):
        """Background monitoring loop"""
        while self.monitoring:
            timestamp = time.time()

            # GPU memory
            if torch.cuda.is_available():
                gpu_memory = torch.cuda.memory_allocated() / 1024 / 1024  # MB
            else:
                gpu_memory = 0

            # CPU usage
            cpu_usage = psutil.cpu_percent()

            self.metrics['gpu_memory'].append(gpu_memory)
            self.metrics['cpu_usage'].append(cpu_usage)
            self.metrics['timestamps'].append(timestamp)

            time.sleep(0.1)  # Monitor every 100ms

    def log_inference(self, latency, batch_size=1):
        """Log inference metrics"""
        self.metrics['latency'].append(latency * 1000)  # Convert to ms
        self.metrics['throughput'].append(batch_size / latency)  # samples/sec

    def get_stats(self):
        """Get current statistics"""
        if not self.metrics['latency']:
            return {}

        return {
            'avg_latency_ms': sum(self.metrics['latency']) / len(self.metrics['latency']),
            'avg_throughput': sum(self.metrics['throughput']) / len(self.metrics['throughput']),
            'avg_gpu_memory_mb': sum(self.metrics['gpu_memory']) / len(self.metrics['gpu_memory']),
            'avg_cpu_usage': sum(self.metrics['cpu_usage']) / len(self.metrics['cpu_usage']),
            'total_inferences': len(self.metrics['latency'])
        }

    def plot_metrics(self, save_path="nlp_performance.png"):
        """Plot performance metrics"""
        fig, axes = plt.subplots(2, 2, figsize=(12, 8))

        # Latency
        axes[0, 0].plot(list(self.metrics['latency']))
        axes[0, 0].set_title('Inference Latency (ms)')
        axes[0, 0].set_ylabel('Latency (ms)')

        # Throughput
        axes[0, 1].plot(list(self.metrics['throughput']))
        axes[0, 1].set_title('Throughput (samples/sec)')
        axes[0, 1].set_ylabel('Samples/sec')

        # GPU Memory
        axes[1, 0].plot(list(self.metrics['gpu_memory']))
        axes[1, 0].set_title('GPU Memory Usage (MB)')
        axes[1, 0].set_ylabel('Memory (MB)')

        # CPU Usage
        axes[1, 1].plot(list(self.metrics['cpu_usage']))
        axes[1, 1].set_title('CPU Usage (%)')
        axes[1, 1].set_ylabel('CPU %')

        plt.tight_layout()
        plt.savefig(save_path)
        plt.show()

        print(f"๐Ÿ“Š Performance plots saved to {save_path}")

๐Ÿงช Bonus Lab: Export HuggingFace โ†’ ONNX โ†’ TensorRT

  1. Export:
import torch
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
dummy = torch.randint(0, 100, (1, 64))
torch.onnx.export(model, (dummy,), "model.onnx", input_names=["input_ids"])
  1. Convert:
trtexec --onnx=model.onnx --saveEngine=model.trt
  1. Run using TensorRT Python bindings or onnxruntime-gpu

๐Ÿš€ Production Deployment Strategies

1. ๐Ÿณ Multi-Stage Docker Optimization

# Dockerfile.nlp-production
# Multi-stage build for optimized NLP deployment
FROM nvcr.io/nvidia/pytorch:24.04-py3 as builder

# Install build dependencies
RUN pip install transformers torch-audio torchaudio torchvision
RUN pip install onnx onnxruntime-gpu tensorrt

# Copy and optimize models
COPY models/ /tmp/models/
COPY scripts/optimize_models.py /tmp/
RUN python /tmp/optimize_models.py

# Production stage
FROM nvcr.io/nvidia/pytorch:24.04-py3

# Install only runtime dependencies
RUN pip install --no-cache-dir \
    transformers==4.36.0 \
    torch==2.1.0 \
    onnxruntime-gpu==1.16.0 \
    fastapi==0.104.0 \
    uvicorn==0.24.0

# Copy optimized models
COPY --from=builder /tmp/optimized_models/ /app/models/
COPY src/ /app/src/

WORKDIR /app
EXPOSE 8000

CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000"]

2. ๐ŸŒ FastAPI Production Server

Use the to compare different optimization methods:

# Run all optimization benchmarks
python jetson_nlp_toolkit.py optimize --method all

# Run specific optimization methods
python jetson_nlp_toolkit.py optimize --method quantization
python jetson_nlp_toolkit.py optimize --method pruning --ratio 0.3
python jetson_nlp_toolkit.py optimize --method distillation

# Benchmark with custom model
python jetson_nlp_toolkit.py optimize --model "bert-base-uncased" --samples 100

# Save optimization results
python jetson_nlp_toolkit.py optimize --output optimization_results.json

The optimization benchmark compares: - FP32: Full precision baseline - FP16: Half precision for GPU acceleration - INT8 Dynamic: Dynamic quantization for CPU - TorchScript: Graph optimization - Model Pruning: Structured weight pruning - Knowledge Distillation: Teacher-student training

Metrics measured: - Average latency per sample - Throughput (samples/second) - Memory usage - Model size compression - Speedup vs baseline


๐Ÿงช Lab 3: Compare LLM Inference in Containers

๐ŸŽฏ Objective

Evaluate inference speed and memory usage for different LLM deployment methods on Jetson inside Docker containers.

๐Ÿ”ง Setup Container for Each Method

HuggingFace Container:

docker run --rm -it --runtime nvidia \
  -v $(pwd):/workspace \
  nvcr.io/nvidia/pytorch:24.04-py3 /bin/bash

Inside container:

pip install transformers accelerate torch

llama.cpp Container:

docker run --rm -it --runtime nvidia \
  -v $(pwd)/models:/models \
  jetson-llama-cpp /bin/bash

(Assumes container has CUDA + llama.cpp compiled)

Ollama Container:

docker run --rm -it --network host \
  -v ollama:/root/.ollama ollama/ollama

๐Ÿ” Run LLM Inference Comparison

Use the ime Performance Monitoring

# performance_monitor.py
import time
import psutil
import torch
from collections import deque
import matplotlib.pyplot as plt
from threading import Thread

class JetsonNLPMonitor:
    def __init__(self, window_size=100):
        self.window_size = window_size
        self.metrics = {
            'latency': deque(maxlen=window_size),
            'throughput': deque(maxlen=window_size),
            'gpu_memory': deque(maxlen=window_size),
            'cpu_usage': deque(maxlen=window_size),
            'timestamps': deque(maxlen=window_size)
        }
        self.monitoring = False

    def start_monitoring(self):
        """Start background monitoring"""
        self.monitoring = True
        monitor_thread = Thread(target=self._monitor_loop)
        monitor_thread.daemon = True
        monitor_thread.start()

    def stop_monitoring(self):
        """Stop monitoring"""
        self.monitoring = False

    def _monitor_loop(self):
        """Background monitoring loop"""
        while self.monitoring:
            timestamp = time.time()

            # GPU memory
            if torch.cuda.is_available():
                gpu_memory = torch.cuda.memory_allocated() / 1024 / 1024  # MB
            else:
                gpu_memory = 0

            # CPU usage
            cpu_usage = psutil.cpu_percent()

            self.metrics['gpu_memory'].append(gpu_memory)
            self.metrics['cpu_usage'].append(cpu_usage)
            self.metrics['timestamps'].append(timestamp)

            time.sleep(0.1)  # Monitor every 100ms

    def log_inference(self, latency, batch_size=1):
        """Log inference metrics"""
        self.metrics['latency'].append(latency * 1000)  # Convert to ms
        self.metrics['throughput'].append(batch_size / latency)  # samples/sec

    def get_stats(self):
        """Get current statistics"""
        if not self.metrics['latency']:
            return {}

        return {
            'avg_latency_ms': sum(self.metrics['latency']) / len(self.metrics['latency']),
            'avg_throughput': sum(self.metrics['throughput']) / len(self.metrics['throughput']),
            'avg_gpu_memory_mb': sum(self.metrics['gpu_memory']) / len(self.metrics['gpu_memory']),
            'avg_cpu_usage': sum(self.metrics['cpu_usage']) / len(self.metrics['cpu_usage']),
            'total_inferences': len(self.metrics['latency'])
        }

    def plot_metrics(self, save_path="nlp_performance.png"):
        """Plot performance metrics"""
        fig, axes = plt.subplots(2, 2, figsize=(12, 8))

        # Latency
        axes[0, 0].plot(list(self.metrics['latency']))
        axes[0, 0].set_title('Inference Latency (ms)')
        axes[0, 0].set_ylabel('Latency (ms)')

        # Throughput
        axes[0, 1].plot(list(self.metrics['throughput']))
        axes[0, 1].set_title('Throughput (samples/sec)')
        axes[0, 1].set_ylabel('Samples/sec')

        # GPU Memory
        axes[1, 0].plot(list(self.metrics['gpu_memory']))
        axes[1, 0].set_title('GPU Memory Usage (MB)')
        axes[1, 0].set_ylabel('Memory (MB)')

        # CPU Usage
        axes[1, 1].plot(list(self.metrics['cpu_usage']))
        axes[1, 1].set_title('CPU Usage (%)')
        axes[1, 1].set_ylabel('CPU %')

        plt.tight_layout()
        plt.savefig(save_path)
        plt.show()

        print(f"๐Ÿ“Š Performance plots saved to {save_path}")

๐Ÿงช Bonus Lab: Export HuggingFace โ†’ ONNX โ†’ TensorRT

  1. Export:
import torch
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
dummy = torch.randint(0, 100, (1, 64))
torch.onnx.export(model, (dummy,), "model.onnx", input_names=["input_ids"])
  1. Convert:
trtexec --onnx=model.onnx --saveEngine=model.trt
  1. Run using TensorRT Python bindings or onnxruntime-gpu

๐Ÿš€ Production Deployment Strategies

1. ๐Ÿณ Multi-Stage Docker Optimization

# Dockerfile.nlp-production
# Multi-stage build for optimized NLP deployment
FROM nvcr.io/nvidia/pytorch:24.04-py3 as builder

# Install build dependencies
RUN pip install transformers torch-audio torchaudio torchvision
RUN pip install onnx onnxruntime-gpu tensorrt

# Copy and optimize models
COPY models/ /tmp/models/
COPY scripts/optimize_models.py /tmp/
RUN python /tmp/optimize_models.py

# Production stage
FROM nvcr.io/nvidia/pytorch:24.04-py3

# Install only runtime dependencies
RUN pip install --no-cache-dir \
    transformers==4.36.0 \
    torch==2.1.0 \
    onnxruntime-gpu==1.16.0 \
    fastapi==0.104.0 \
    uvicorn==0.24.0

# Copy optimized models
COPY --from=builder /tmp/optimized_models/ /app/models/
COPY src/ /app/src/

WORKDIR /app
EXPOSE 8000

CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000"]

2. ๐ŸŒ FastAPI Production Server

Use the to compare different LLM inference methods:

# Compare all available LLM inference methods
python jetson_nlp_toolkit.py llm --method all

# Test specific inference methods
python jetson_nlp_toolkit.py llm --method huggingface --model "microsoft/DialoGPT-small"
python jetson_nlp_toolkit.py llm --method llamacpp --model-path "/models/mistral.gguf"
python jetson_nlp_toolkit.py llm --method ollama --model "mistral"

# Custom prompts and settings
python jetson_nlp_toolkit.py llm --prompts "Explain the future of AI in education." --max-tokens 100

# Save comparison results
python jetson_nlp_toolkit.py llm --output llm_comparison_results.json

The LLM comparison evaluates: - HuggingFace Transformers: GPU-optimized inference - llama.cpp: CPU-optimized quantized models
- Ollama API: Containerized LLM serving

Metrics measured: - Average latency per prompt - Tokens generated per second - Memory usage - Model loading time - Response quality


๐Ÿ“Š Record Results

Method Latency (s) Tokens/sec GPU Mem (MB)
HuggingFace PyTorch
llama-cpp-python
Ollama REST API

Use tegrastats or jtop to observe GPU memory and CPU usage during inference.


๐Ÿ“‹ Lab Deliverables

For Lab 1 (Multi-Application Benchmark):

  • Completed evaluation results JSON file
  • Performance comparison charts for all NLP tasks
  • Analysis report identifying best models for each task on Jetson
  • Resource utilization graphs (tegrastats screenshots)

For Lab 2 (Optimization Techniques):

  • Quantization comparison table
  • Memory usage analysis
  • Speedup and compression ratio calculations
  • Recommendations for production deployment

For Lab 3 (LLM Container Comparison):

  • Completed benchmark table
  • Screenshots of tegrastats during inference
  • Analysis: Which approach is fastest, lightest, and most accurate for Jetson?

๐ŸŽฏ Advanced NLP Optimization Strategies

1. ๐Ÿ”ง Model Pruning for Jetson

# model_pruning.py
import torch
import torch.nn.utils.prune as prune
from transformers import AutoModelForSequenceClassification

def prune_model(model, pruning_ratio=0.2):
    """Apply structured pruning to transformer model"""
    for name, module in model.named_modules():
        if isinstance(module, torch.nn.Linear):
            prune.l1_unstructured(module, name='weight', amount=pruning_ratio)
            prune.remove(module, 'weight')

    return model

# Example usage
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
pruned_model = prune_model(model, pruning_ratio=0.3)

2. ๐Ÿš€ Knowledge Distillation

# knowledge_distillation.py
import torch
import torch.nn as nn
from transformers import AutoModelForSequenceClassification, AutoTokenizer

class DistillationTrainer:
    def __init__(self, teacher_model, student_model, temperature=3.0, alpha=0.7):
        self.teacher = teacher_model
        self.student = student_model
        self.temperature = temperature
        self.alpha = alpha
        self.kl_loss = nn.KLDivLoss(reduction='batchmean')
        self.ce_loss = nn.CrossEntropyLoss()

    def distillation_loss(self, student_logits, teacher_logits, labels):
        """Calculate distillation loss"""
        # Soft targets from teacher
        teacher_probs = torch.softmax(teacher_logits / self.temperature, dim=1)
        student_log_probs = torch.log_softmax(student_logits / self.temperature, dim=1)

        # Distillation loss
        distill_loss = self.kl_loss(student_log_probs, teacher_probs) * (self.temperature ** 2)

        # Hard target loss
        hard_loss = self.ce_loss(student_logits, labels)

        # Combined loss
        total_loss = self.alpha * distill_loss + (1 - self.alpha) * hard_loss
        return total_loss

3. ๐Ÿ”„ Dynamic Batching for Real-time Inference

# dynamic_batching.py
import asyncio
import time
from collections import deque
from typing import List, Tuple

class DynamicBatcher:
    def __init__(self, model, tokenizer, max_batch_size=8, max_wait_time=0.1):
        self.model = model
        self.tokenizer = tokenizer
        self.max_batch_size = max_batch_size
        self.max_wait_time = max_wait_time
        self.request_queue = deque()
        self.processing = False

    async def add_request(self, text: str) -> str:
        """Add inference request to queue"""
        future = asyncio.Future()
        self.request_queue.append((text, future))

        if not self.processing:
            asyncio.create_task(self.process_batch())

        return await future

    async def process_batch(self):
        """Process requests in batches"""
        self.processing = True

        while self.request_queue:
            batch = []
            futures = []
            start_time = time.time()

            # Collect batch
            while (len(batch) < self.max_batch_size and 
                   self.request_queue and 
                   (time.time() - start_time) < self.max_wait_time):

                text, future = self.request_queue.popleft()
                batch.append(text)
                futures.append(future)

                if not self.request_queue:
                    await asyncio.sleep(0.01)  # Small wait for more requests

            if batch:
                # Process batch
                results = await self.inference_batch(batch)

                # Return results
                for future, result in zip(futures, results):
                    future.set_result(result)

        self.processing = False

    async def inference_batch(self, texts: List[str]) -> List[str]:
        """Run inference on batch"""
        inputs = self.tokenizer(
            texts, 
            return_tensors="pt", 
            padding=True, 
            truncation=True,
            max_length=512
        )

        with torch.no_grad():
            outputs = self.model(**inputs)
            predictions = torch.argmax(outputs.logits, dim=-1)

        return [f"Prediction: {pred.item()}" for pred in predictions]

4. ๐Ÿ“Š Real-time Performance Monitoring

# performance_monitor.py
import time
import psutil
import torch
from collections import deque
import matplotlib.pyplot as plt
from threading import Thread

class JetsonNLPMonitor:
    def __init__(self, window_size=100):
        self.window_size = window_size
        self.metrics = {
            'latency': deque(maxlen=window_size),
            'throughput': deque(maxlen=window_size),
            'gpu_memory': deque(maxlen=window_size),
            'cpu_usage': deque(maxlen=window_size),
            'timestamps': deque(maxlen=window_size)
        }
        self.monitoring = False

    def start_monitoring(self):
        """Start background monitoring"""
        self.monitoring = True
        monitor_thread = Thread(target=self._monitor_loop)
        monitor_thread.daemon = True
        monitor_thread.start()

    def stop_monitoring(self):
        """Stop monitoring"""
        self.monitoring = False

    def _monitor_loop(self):
        """Background monitoring loop"""
        while self.monitoring:
            timestamp = time.time()

            # GPU memory
            if torch.cuda.is_available():
                gpu_memory = torch.cuda.memory_allocated() / 1024 / 1024  # MB
            else:
                gpu_memory = 0

            # CPU usage
            cpu_usage = psutil.cpu_percent()

            self.metrics['gpu_memory'].append(gpu_memory)
            self.metrics['cpu_usage'].append(cpu_usage)
            self.metrics['timestamps'].append(timestamp)

            time.sleep(0.1)  # Monitor every 100ms

    def log_inference(self, latency, batch_size=1):
        """Log inference metrics"""
        self.metrics['latency'].append(latency * 1000)  # Convert to ms
        self.metrics['throughput'].append(batch_size / latency)  # samples/sec

    def get_stats(self):
        """Get current statistics"""
        if not self.metrics['latency']:
            return {}

        return {
            'avg_latency_ms': sum(self.metrics['latency']) / len(self.metrics['latency']),
            'avg_throughput': sum(self.metrics['throughput']) / len(self.metrics['throughput']),
            'avg_gpu_memory_mb': sum(self.metrics['gpu_memory']) / len(self.metrics['gpu_memory']),
            'avg_cpu_usage': sum(self.metrics['cpu_usage']) / len(self.metrics['cpu_usage']),
            'total_inferences': len(self.metrics['latency'])
        }

    def plot_metrics(self, save_path="nlp_performance.png"):
        """Plot performance metrics"""
        fig, axes = plt.subplots(2, 2, figsize=(12, 8))

        # Latency
        axes[0, 0].plot(list(self.metrics['latency']))
        axes[0, 0].set_title('Inference Latency (ms)')
        axes[0, 0].set_ylabel('Latency (ms)')

        # Throughput
        axes[0, 1].plot(list(self.metrics['throughput']))
        axes[0, 1].set_title('Throughput (samples/sec)')
        axes[0, 1].set_ylabel('Samples/sec')

        # GPU Memory
        axes[1, 0].plot(list(self.metrics['gpu_memory']))
        axes[1, 0].set_title('GPU Memory Usage (MB)')
        axes[1, 0].set_ylabel('Memory (MB)')

        # CPU Usage
        axes[1, 1].plot(list(self.metrics['cpu_usage']))
        axes[1, 1].set_title('CPU Usage (%)')
        axes[1, 1].set_ylabel('CPU %')

        plt.tight_layout()
        plt.savefig(save_path)
        plt.show()

        print(f"๐Ÿ“Š Performance plots saved to {save_path}")

๐Ÿงช Bonus Lab: Export HuggingFace โ†’ ONNX โ†’ TensorRT

  1. Export:
import torch
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
dummy = torch.randint(0, 100, (1, 64))
torch.onnx.export(model, (dummy,), "model.onnx", input_names=["input_ids"])
  1. Convert:
trtexec --onnx=model.onnx --saveEngine=model.trt
  1. Run using TensorRT Python bindings or onnxruntime-gpu

๐Ÿš€ Production Deployment Strategies

1. ๐Ÿณ Multi-Stage Docker Optimization

# Dockerfile.nlp-production
# Multi-stage build for optimized NLP deployment
FROM nvcr.io/nvidia/pytorch:24.04-py3 as builder

# Install build dependencies
RUN pip install transformers torch-audio torchaudio torchvision
RUN pip install onnx onnxruntime-gpu tensorrt

# Copy and optimize models
COPY models/ /tmp/models/
COPY scripts/optimize_models.py /tmp/
RUN python /tmp/optimize_models.py

# Production stage
FROM nvcr.io/nvidia/pytorch:24.04-py3

# Install only runtime dependencies
RUN pip install --no-cache-dir \
    transformers==4.36.0 \
    torch==2.1.0 \
    onnxruntime-gpu==1.16.0 \
    fastapi==0.104.0 \
    uvicorn==0.24.0

# Copy optimized models
COPY --from=builder /tmp/optimized_models/ /app/models/
COPY src/ /app/src/

WORKDIR /app
EXPOSE 8000

CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000"]

2. ๐ŸŒ FastAPI Production Server

Use the to deploy a production-ready NLP server:

# Start production NLP server
python jetson_nlp_toolkit.py server --host 0.0.0.0 --port 8000

# Start with custom models
python jetson_nlp_toolkit.py server --sentiment-model "cardiffnlp/twitter-roberta-base-sentiment-latest"

# Start with monitoring enabled
python jetson_nlp_toolkit.py server --enable-monitoring

The production server includes: - FastAPI Framework: High-performance async API - Model Optimization: FP16 precision and GPU acceleration - Batch Processing: Efficient handling of multiple requests - Health Monitoring: System status and performance metrics - Error Handling: Robust error management and logging - WebSocket Support: Real-time chat interface

API Endpoints Available:

  • GET /health: System health and model status
  • POST /sentiment: Batch sentiment analysis
  • POST /qa: Question answering
  • POST /summarize: Text summarization
  • WebSocket /chat: Real-time chat interface

Testing the Server:

# Test health endpoint
curl http://localhost:8000/health

# Test sentiment analysis
curl -X POST "http://localhost:8000/sentiment" \
     -H "Content-Type: application/json" \
     -d '{"texts": ["I love this product!", "This is terrible"], "batch_size": 2}'

# Test question answering
curl -X POST "http://localhost:8000/qa" \
     -H "Content-Type: application/json" \
     -d '{"questions": ["What is AI?"], "contexts": ["Artificial Intelligence is the simulation of human intelligence."]}'

### 3. ๐Ÿ“Š Load Testing & Performance Validation

Use the <mcfile name="jetson_nlp_toolkit.py" path="jetson/jetson_nlp_toolkit.py"></mcfile> for comprehensive load testing:

```bash
# Run load test on running server
python jetson_nlp_toolkit.py loadtest --url http://localhost:8000

# Custom load test parameters
python jetson_nlp_toolkit.py loadtest --url http://localhost:8000 --concurrent 20 --requests 500

# Test specific endpoints
python jetson_nlp_toolkit.py loadtest --url http://localhost:8000 --endpoint sentiment
python jetson_nlp_toolkit.py loadtest --url http://localhost:8000 --endpoint qa

# Save load test results
python jetson_nlp_toolkit.py loadtest --url http://localhost:8000 --output load_test_results.json

The load tester provides: - Concurrent Testing: Simulates multiple users - Endpoint Coverage: Tests all API endpoints - Performance Metrics: Latency, throughput, success rate - Error Analysis: Detailed failure reporting - Results Export: JSON format for further analysis

---

## ๐ŸŽฏ Final Challenge: Complete NLP Pipeline on Jetson

### ๐Ÿ† Challenge Objective

Build a complete, production-ready NLP pipeline that processes real-world data and demonstrates all optimization techniques learned in this tutorial.

### ๐Ÿ“‹ Challenge Requirements

#### 1. **Multi-Modal NLP System**
Implement a system that handles:
- **Text Classification** (sentiment analysis on product reviews)
- **Information Extraction** (NER on news articles)
- **Question Answering** (FAQ system for customer support)
- **Text Summarization** (news article summarization)
- **Real-time Chat** (customer service chatbot)

#### 2. **Optimization Implementation**
- Apply **quantization** (FP16 minimum, INT8 preferred)
- Implement **dynamic batching** for throughput optimization
- Use **model pruning** to reduce memory footprint
- Deploy with **TensorRT** optimization where possible
- Implement **caching** for frequently requested content

#### 3. **Production Deployment**
- **Containerized deployment** with multi-stage Docker builds
- **REST API** with proper error handling and logging
- **Load balancing** for high availability
- **Monitoring and metrics** collection
- **Auto-scaling** based on resource utilization

#### 4. **Performance Benchmarking**
- **Latency analysis** (P50, P95, P99 percentiles)
- **Throughput measurement** (requests per second)
- **Resource utilization** (GPU/CPU/memory usage)
- **Accuracy validation** on standard datasets
- **Cost analysis** (inference cost per request)

### ๐Ÿ› ๏ธ Implementation Guide

Use the comprehensive <mcfile name="jetson_nlp_toolkit.py" path="jetson/jetson_nlp_toolkit.py"></mcfile> to implement the complete challenge:

```bash
# Run comprehensive evaluation across all NLP tasks
python jetson_nlp_toolkit.py evaluate --all-tasks --save-results challenge_evaluation.json

# Apply all optimization techniques
python jetson_nlp_toolkit.py optimize --all-methods --model distilbert-base-uncased-finetuned-sst-2-english

# Deploy production server with monitoring
python jetson_nlp_toolkit.py server --enable-monitoring --host 0.0.0.0 --port 8000

# Run comprehensive load testing
python jetson_nlp_toolkit.py loadtest --url http://localhost:8000 --concurrent 50 --requests 1000

# Compare LLM inference methods
python jetson_nlp_toolkit.py llm --all-methods --model microsoft/DialoGPT-small

The toolkit provides all necessary components: - Multi-task NLP Pipeline: Sentiment, NER, QA, Summarization, Chat - Optimization Techniques: Quantization, Pruning, Distillation, TensorRT - Production Deployment: FastAPI server with WebSocket support - Performance Monitoring: Real-time metrics and analysis - Load Testing: Comprehensive performance validation - Caching & Batching: Redis integration and dynamic batching

๐Ÿ“Š Evaluation Criteria

Criterion Weight Excellent (90-100%) Good (70-89%) Satisfactory (50-69%)
Functionality 25% All 5 NLP tasks working perfectly 4/5 tasks working 3/5 tasks working
Optimization 25% All optimization techniques applied Most optimizations applied Basic optimizations
Performance 20% <50ms P95 latency, >100 req/s <100ms P95, >50 req/s <200ms P95, >20 req/s
Production Ready 15% Full deployment with monitoring Basic deployment Local deployment only
Code Quality 10% Clean, documented, tested Well-structured Basic implementation
Innovation 5% Novel optimizations/features Creative solutions Standard implementation

๐ŸŽฏ Bonus Challenges

  1. Multi-Language Support: Extend the pipeline to handle multiple languages
  2. Edge Deployment: Deploy on actual Jetson hardware with resource constraints
  3. Federated Learning: Implement model updates without centralized data
  4. Real-time Streaming: Process continuous data streams with low latency
  5. Custom Models: Train and deploy domain-specific models

๐Ÿ“Œ Summary

  • Comprehensive NLP Applications: Covered 6 major NLP tasks with Jetson-specific optimizations
  • Advanced Optimization Techniques: Quantization, pruning, distillation, and dynamic batching
  • Production Deployment: Multi-stage Docker builds, FastAPI servers, and load testing
  • Performance Monitoring: Real-time metrics collection and analysis
  • Practical Evaluation: Standardized benchmarking on popular datasets
  • Complete Pipeline: End-to-end solution from development to production
  • All-in-One Toolkit: Unified command-line tool (jetson_nlp_toolkit.py) combining all examples and implementations

This tutorial provides a comprehensive foundation for deploying production-ready NLP applications on Jetson devices, balancing performance, accuracy, and resource efficiency. The included all-in-one toolkit makes it easy to experiment with different NLP tasks, optimization techniques, and deployment strategies using a simple command-line interface.