๐ RAG Applications with LangChain on Jetson¶
Author: Dr. Kaikai Liu, Ph.D.
Position: Associate Professor, Computer Engineering
Institution: San Jose State University
Contact: kaikai.liu@sjsu.edu
๐ค What is RAG?¶
RAG (Retrieval-Augmented Generation) is a powerful paradigm that combines the generative capabilities of Large Language Models (LLMs) with external knowledge retrieval. Unlike traditional LLMs that rely solely on their pre-trained parameters, RAG systems dynamically retrieve relevant information from external knowledge bases to augment the generation process.
๐ฌ The Technology Behind RAG¶
Core Components:¶
- Knowledge Base: External documents, databases, or structured data
- Embedding Model: Converts text into high-dimensional vector representations
- Vector Database: Stores and indexes embeddings for efficient similarity search
- Retriever: Finds relevant documents based on query similarity
- Generator: LLM that produces responses using retrieved context
RAG Workflow:¶
User Query โ Embedding โ Vector Search โ Context Retrieval โ LLM Generation โ Response
๐ง Understanding Embeddings¶
Embeddings are dense vector representations that capture semantic meaning of text. They enable machines to understand that "car" and "automobile" are semantically similar, even though they share no common characters.
Key Properties:¶
- Dimensionality: Typically 384-1536 dimensions
- Semantic Similarity: Similar concepts have similar vectors
- Distance Metrics: Cosine similarity, Euclidean distance, dot product
- Context Awareness: Modern embeddings capture contextual meaning
Types of Embedding Models:¶
- Sentence Transformers: Optimized for semantic similarity
- OpenAI Embeddings: High-quality but require API calls
- Local Models: BERT, RoBERTa, E5, BGE variants
- Multilingual Models: Support multiple languages
- Domain-Specific: Fine-tuned for specific domains (legal, medical, code)
๐ก Why RAG on Jetson?¶
Edge AI Advantages:¶
- Privacy: Keep sensitive data local, no cloud dependency
- Latency: Sub-second response times for real-time applications
- Reliability: Works offline, immune to network issues
- Cost: No API costs for embeddings and inference
- Customization: Fine-tune models for specific use cases
Jetson-Specific Benefits:¶
- GPU Acceleration: CUDA support for embedding computation
- Memory Efficiency: Optimized models for edge deployment
- Power Efficiency: Low-power consumption for continuous operation
- Integration: Easy integration with sensors and IoT devices
๐ง Advanced RAG Architecture¶
๐ RAG Pipeline Components¶
graph TD
A[User Query] --> B[Query Embedding]
B --> C[Vector Search]
C --> D[Document Retrieval]
D --> E[Context Ranking]
E --> F[Prompt Construction]
F --> G[LLM Generation]
G --> H[Response]
I[Document Corpus] --> J[Text Chunking]
J --> K[Embedding Generation]
K --> L[Vector Storage]
L --> C
๐๏ธ Detailed Architecture¶
- Indexing Phase (Offline):
- Document loading and preprocessing
- Text chunking with overlap strategies
- Embedding generation using optimized models
-
Vector storage with efficient indexing
-
Retrieval Phase (Online):
- Query embedding generation
- Similarity search in vector space
- Context ranking and filtering
-
Prompt template construction
-
Generation Phase:
- Context-aware LLM inference
- Response generation and post-processing
- Optional fact-checking and validation
๐ฏ RAG Optimization Strategies¶
Chunking Strategies:¶
- Fixed-size: Simple but may break semantic units
- Semantic: Preserve paragraph/sentence boundaries
- Recursive: Hierarchical chunking with multiple levels
- Sliding window: Overlapping chunks for context preservation
Retrieval Strategies:¶
- Dense Retrieval: Vector similarity search
- Sparse Retrieval: BM25, TF-IDF keyword matching
- Hybrid Retrieval: Combine dense and sparse methods
- Multi-vector: Multiple embeddings per document
Context Enhancement:¶
- Re-ranking: Secondary ranking of retrieved documents
- Query expansion: Enhance queries with related terms
- Context compression: Summarize long retrieved contexts
- Multi-hop: Iterative retrieval for complex queries
๐งฑ Key Components in LangChain¶
DocumentLoader
: Load PDFs, markdowns, text, web pagesTextSplitter
: Break documents into chunks with various strategiesEmbeddings
: Convert text to vectors using local or cloud modelsVectorStore
: FAISS, Chroma, Qdrant, Weaviate, PineconeRetriever
: Pull relevant chunks with ranking and filteringLLM
: Generate final answer using GGUF models via llama-cpp or Ollama
๐ Embedding Models Evaluation on Jetson¶
๐ Embedding Model Comparison¶
We'll evaluate various embedding models on Jetson platforms focusing on: - Performance: Inference speed and throughput - Quality: Semantic similarity accuracy - Memory: RAM and VRAM usage - Power: Energy consumption
๐ฌ Embedding Models Benchmark¶
import time
import psutil
import torch
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModel
import numpy as np
from typing import List, Dict, Any
class EmbeddingBenchmark:
"""Comprehensive embedding model benchmark for Jetson"""
def __init__(self):
self.models = {
# Lightweight models (< 100MB)
"all-MiniLM-L6-v2": "sentence-transformers/all-MiniLM-L6-v2",
"all-MiniLM-L12-v2": "sentence-transformers/all-MiniLM-L12-v2",
"paraphrase-MiniLM-L6-v2": "sentence-transformers/paraphrase-MiniLM-L6-v2",
# Medium models (100-500MB)
"all-mpnet-base-v2": "sentence-transformers/all-mpnet-base-v2",
"all-roberta-large-v1": "sentence-transformers/all-roberta-large-v1",
"e5-base-v2": "intfloat/e5-base-v2",
# Large models (> 500MB)
"e5-large-v2": "intfloat/e5-large-v2",
"bge-large-en-v1.5": "BAAI/bge-large-en-v1.5",
"gte-large": "thenlper/gte-large",
# Multilingual models
"paraphrase-multilingual-MiniLM-L12-v2": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
"multilingual-e5-base": "intfloat/multilingual-e5-base"
}
self.test_texts = [
"NVIDIA Jetson is a series of embedded computing boards from NVIDIA.",
"The Jetson platform provides AI computing at the edge with GPU acceleration.",
"Deep learning models can be optimized for inference on Jetson devices.",
"TensorRT enables high-performance inference on NVIDIA GPUs.",
"Edge AI applications benefit from local processing capabilities."
] * 20 # 100 texts for throughput testing
def benchmark_model(self, model_name: str, model_path: str) -> Dict[str, Any]:
"""Benchmark a single embedding model"""
print(f"\n๐ Benchmarking {model_name}...")
try:
# Load model
start_time = time.time()
if "sentence-transformers" in model_path:
model = SentenceTransformer(model_path)
else:
# Use transformers directly for more control
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModel.from_pretrained(model_path)
load_time = time.time() - start_time
# Memory usage before inference
process = psutil.Process()
memory_before = process.memory_info().rss / 1024 / 1024 # MB
# GPU memory (if available)
gpu_memory_before = 0
if torch.cuda.is_available():
torch.cuda.empty_cache()
gpu_memory_before = torch.cuda.memory_allocated() / 1024 / 1024 # MB
# Warm-up
if hasattr(model, 'encode'):
_ = model.encode(["warm up"])
else:
# For transformers models
inputs = tokenizer(["warm up"], return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
_ = model(**inputs)
# Benchmark inference speed
start_time = time.time()
if hasattr(model, 'encode'):
embeddings = model.encode(self.test_texts, batch_size=32, show_progress_bar=False)
else:
# For transformers models - batch processing
embeddings = []
batch_size = 32
for i in range(0, len(self.test_texts), batch_size):
batch = self.test_texts[i:i+batch_size]
inputs = tokenizer(batch, return_tensors="pt", padding=True, truncation=True, max_length=512)
with torch.no_grad():
outputs = model(**inputs)
# Mean pooling
batch_embeddings = outputs.last_hidden_state.mean(dim=1)
embeddings.extend(batch_embeddings.cpu().numpy())
embeddings = np.array(embeddings)
inference_time = time.time() - start_time
# Memory usage after inference
memory_after = process.memory_info().rss / 1024 / 1024 # MB
gpu_memory_after = 0
if torch.cuda.is_available():
gpu_memory_after = torch.cuda.memory_allocated() / 1024 / 1024 # MB
# Calculate metrics
throughput = len(self.test_texts) / inference_time # texts per second
avg_latency = inference_time / len(self.test_texts) * 1000 # ms per text
results = {
"model_name": model_name,
"load_time_s": round(load_time, 3),
"inference_time_s": round(inference_time, 3),
"throughput_texts_per_sec": round(throughput, 2),
"avg_latency_ms": round(avg_latency, 3),
"memory_usage_mb": round(memory_after - memory_before, 2),
"gpu_memory_mb": round(gpu_memory_after - gpu_memory_before, 2),
"embedding_dim": embeddings.shape[1],
"total_memory_mb": round(memory_after, 2),
"status": "success"
}
# Clean up
del model
if torch.cuda.is_available():
torch.cuda.empty_cache()
return results
except Exception as e:
return {
"model_name": model_name,
"status": "failed",
"error": str(e)
}
def run_full_benchmark(self) -> List[Dict[str, Any]]:
"""Run benchmark on all models"""
results = []
print("๐ Starting Embedding Models Benchmark on Jetson")
print("=" * 60)
for model_name, model_path in self.models.items():
result = self.benchmark_model(model_name, model_path)
results.append(result)
if result["status"] == "success":
print(f"โ
{model_name}: {result['throughput_texts_per_sec']:.1f} texts/sec, "
f"{result['avg_latency_ms']:.1f}ms latency, "
f"{result['memory_usage_mb']:.1f}MB memory")
else:
print(f"โ {model_name}: {result['error']}")
return results
def generate_report(self, results: List[Dict[str, Any]]) -> str:
"""Generate comprehensive benchmark report"""
successful_results = [r for r in results if r["status"] == "success"]
if not successful_results:
return "No successful benchmarks to report."
# Sort by throughput
successful_results.sort(key=lambda x: x["throughput_texts_per_sec"], reverse=True)
report = "\n" + "=" * 80
report += "\n๐ JETSON EMBEDDING MODELS BENCHMARK REPORT\n"
report += "=" * 80 + "\n"
# Performance ranking
report += "\n๐ PERFORMANCE RANKING (by throughput):\n"
report += "-" * 50 + "\n"
for i, result in enumerate(successful_results[:5], 1):
report += f"{i}. {result['model_name']:<30} {result['throughput_texts_per_sec']:>8.1f} texts/sec\n"
# Memory efficiency ranking
memory_sorted = sorted(successful_results, key=lambda x: x["memory_usage_mb"])
report += "\n๐พ MEMORY EFFICIENCY RANKING:\n"
report += "-" * 50 + "\n"
for i, result in enumerate(memory_sorted[:5], 1):
report += f"{i}. {result['model_name']:<30} {result['memory_usage_mb']:>8.1f} MB\n"
# Latency ranking
latency_sorted = sorted(successful_results, key=lambda x: x["avg_latency_ms"])
report += "\nโก LATENCY RANKING (lower is better):\n"
report += "-" * 50 + "\n"
for i, result in enumerate(latency_sorted[:5], 1):
report += f"{i}. {result['model_name']:<30} {result['avg_latency_ms']:>8.1f} ms\n"
# Recommendations
report += "\n๐ฏ RECOMMENDATIONS FOR JETSON:\n"
report += "-" * 50 + "\n"
fastest = successful_results[0]
most_efficient = memory_sorted[0]
lowest_latency = latency_sorted[0]
report += f"๐ Best Performance: {fastest['model_name']}\n"
report += f"๐พ Most Memory Efficient: {most_efficient['model_name']}\n"
report += f"โก Lowest Latency: {lowest_latency['model_name']}\n"
# Use case recommendations
report += "\n๐ USE CASE RECOMMENDATIONS:\n"
report += "-" * 50 + "\n"
report += "๐น Real-time applications: Use models with <50ms latency\n"
report += "๐น Batch processing: Use highest throughput models\n"
report += "๐น Memory-constrained: Use models with <200MB memory usage\n"
report += "๐น Production deployment: Balance performance and memory efficiency\n"
return report
# Run the benchmark
benchmark = EmbeddingBenchmark()
results = benchmark.run_full_benchmark()
report = benchmark.generate_report(results)
print(report)
๐ Expected Performance Results on Jetson Orin Nano¶
Model | Throughput (texts/sec) | Latency (ms) | Memory (MB) | Embedding Dim |
---|---|---|---|---|
all-MiniLM-L6-v2 | 45.2 | 22.1 | 90 | 384 |
paraphrase-MiniLM-L6-v2 | 43.8 | 22.8 | 90 | 384 |
all-MiniLM-L12-v2 | 28.5 | 35.1 | 134 | 384 |
e5-base-v2 | 22.1 | 45.2 | 438 | 768 |
all-mpnet-base-v2 | 18.7 | 53.5 | 438 | 768 |
bge-large-en-v1.5 | 8.9 | 112.4 | 1340 | 1024 |
๐ Model Recommendations¶
For Real-time Applications (< 50ms latency):¶
- โ
all-MiniLM-L6-v2
: Best balance of speed and quality - โ
paraphrase-MiniLM-L6-v2
: Good for paraphrase detection - โ
all-MiniLM-L12-v2
: Slightly better quality, acceptable latency
For High-Quality Embeddings:¶
- โ
e5-base-v2
: Excellent quality-performance balance - โ
all-mpnet-base-v2
: Strong semantic understanding - โ ๏ธ
bge-large-en-v1.5
: Best quality but slower
For Memory-Constrained Environments:¶
- โ
all-MiniLM-L6-v2
: Only 90MB memory usage - โ
paraphrase-MiniLM-L6-v2
: Lightweight and efficient
For Multilingual Applications:¶
- โ
paraphrase-multilingual-MiniLM-L12-v2
: Good multilingual support - โ
multilingual-e5-base
: Better quality for non-English
๐๏ธ Vector Database Evaluation on Jetson¶
๐ Vector Database Comparison¶
We'll evaluate different vector databases for Jetson deployment focusing on: - Performance: Query speed and indexing time - Memory Efficiency: RAM usage and storage requirements - Scalability: Handling large document collections - Features: Filtering, metadata support, persistence
๐ฌ Vector Database Benchmark¶
import time
import os
import shutil
import numpy as np
from typing import List, Dict, Any, Tuple
import psutil
from sentence_transformers import SentenceTransformer
# Vector database imports
try:
import chromadb
from chromadb.config import Settings
except ImportError:
chromadb = None
try:
import faiss
except ImportError:
faiss = None
try:
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
except ImportError:
QdrantClient = None
try:
import weaviate
except ImportError:
weaviate = None
class VectorDBBenchmark:
"""Comprehensive vector database benchmark for Jetson"""
def __init__(self, embedding_model: str = "all-MiniLM-L6-v2"):
self.embedding_model = SentenceTransformer(embedding_model)
self.embedding_dim = 384 # for all-MiniLM-L6-v2
# Generate test data
self.documents = self._generate_test_documents()
self.embeddings = self._generate_embeddings()
self.query_texts = [
"What is NVIDIA Jetson used for?",
"How to optimize deep learning models?",
"Edge AI computing benefits",
"TensorRT optimization techniques",
"Real-time inference on embedded devices"
]
self.query_embeddings = self.embedding_model.encode(self.query_texts)
# Test configurations
self.test_sizes = [100, 500, 1000, 5000] # Number of documents
def _generate_test_documents(self) -> List[Dict[str, Any]]:
"""Generate synthetic test documents"""
base_texts = [
"NVIDIA Jetson is a series of embedded computing boards designed for AI applications.",
"Deep learning models can be optimized using TensorRT for faster inference.",
"Edge AI enables real-time processing without cloud connectivity.",
"Computer vision applications benefit from GPU acceleration on Jetson.",
"Natural language processing models can run locally on edge devices.",
"Autonomous vehicles use edge computing for real-time decision making.",
"IoT devices with AI capabilities enable smart city applications.",
"Robotics applications leverage edge AI for autonomous navigation.",
"Industrial automation uses AI for predictive maintenance.",
"Healthcare devices benefit from local AI processing for privacy."
]
documents = []
for i in range(5000): # Generate 5000 documents
base_idx = i % len(base_texts)
text = f"{base_texts[base_idx]} Document {i} with additional context and variations."
documents.append({
"id": f"doc_{i}",
"text": text,
"metadata": {
"category": ["AI", "Edge Computing", "Jetson", "Deep Learning"][i % 4],
"priority": i % 3,
"timestamp": f"2024-01-{(i % 30) + 1:02d}"
}
})
return documents
def _generate_embeddings(self) -> np.ndarray:
"""Generate embeddings for test documents"""
texts = [doc["text"] for doc in self.documents]
return self.embedding_model.encode(texts, batch_size=32, show_progress_bar=False)
def benchmark_chromadb(self, num_docs: int) -> Dict[str, Any]:
"""Benchmark ChromaDB"""
if chromadb is None:
return {"status": "failed", "error": "ChromaDB not installed"}
try:
# Setup
db_path = "./benchmark_chroma"
if os.path.exists(db_path):
shutil.rmtree(db_path)
client = chromadb.PersistentClient(path=db_path)
collection = client.create_collection(
name="benchmark",
metadata={"hnsw:space": "cosine"}
)
# Indexing benchmark
docs_subset = self.documents[:num_docs]
embeddings_subset = self.embeddings[:num_docs]
start_time = time.time()
memory_before = psutil.Process().memory_info().rss / 1024 / 1024
# Add documents in batches
batch_size = 100
for i in range(0, num_docs, batch_size):
batch_end = min(i + batch_size, num_docs)
batch_docs = docs_subset[i:batch_end]
batch_embeddings = embeddings_subset[i:batch_end]
collection.add(
embeddings=batch_embeddings.tolist(),
documents=[doc["text"] for doc in batch_docs],
metadatas=[doc["metadata"] for doc in batch_docs],
ids=[doc["id"] for doc in batch_docs]
)
indexing_time = time.time() - start_time
memory_after = psutil.Process().memory_info().rss / 1024 / 1024
# Query benchmark
query_times = []
for query_embedding in self.query_embeddings:
start_time = time.time()
results = collection.query(
query_embeddings=[query_embedding.tolist()],
n_results=5
)
query_times.append(time.time() - start_time)
avg_query_time = np.mean(query_times) * 1000 # ms
# Storage size
storage_size = sum(os.path.getsize(os.path.join(db_path, f))
for f in os.listdir(db_path) if os.path.isfile(os.path.join(db_path, f)))
storage_size_mb = storage_size / 1024 / 1024
# Cleanup
client.delete_collection("benchmark")
shutil.rmtree(db_path)
return {
"database": "ChromaDB",
"num_docs": num_docs,
"indexing_time_s": round(indexing_time, 3),
"avg_query_time_ms": round(avg_query_time, 3),
"memory_usage_mb": round(memory_after - memory_before, 2),
"storage_size_mb": round(storage_size_mb, 2),
"throughput_docs_per_sec": round(num_docs / indexing_time, 2),
"status": "success"
}
except Exception as e:
return {
"database": "ChromaDB",
"status": "failed",
"error": str(e)
}
def benchmark_faiss(self, num_docs: int) -> Dict[str, Any]:
"""Benchmark FAISS"""
if faiss is None:
return {"status": "failed", "error": "FAISS not installed"}
try:
# Setup FAISS index
embeddings_subset = self.embeddings[:num_docs].astype('float32')
start_time = time.time()
memory_before = psutil.Process().memory_info().rss / 1024 / 1024
# Create and train index
index = faiss.IndexFlatIP(self.embedding_dim) # Inner product (cosine similarity)
index.add(embeddings_subset)
indexing_time = time.time() - start_time
memory_after = psutil.Process().memory_info().rss / 1024 / 1024
# Query benchmark
query_times = []
for query_embedding in self.query_embeddings:
start_time = time.time()
query_vector = query_embedding.astype('float32').reshape(1, -1)
distances, indices = index.search(query_vector, 5)
query_times.append(time.time() - start_time)
avg_query_time = np.mean(query_times) * 1000 # ms
# Estimate storage size (in-memory)
storage_size_mb = embeddings_subset.nbytes / 1024 / 1024
return {
"database": "FAISS",
"num_docs": num_docs,
"indexing_time_s": round(indexing_time, 3),
"avg_query_time_ms": round(avg_query_time, 3),
"memory_usage_mb": round(memory_after - memory_before, 2),
"storage_size_mb": round(storage_size_mb, 2),
"throughput_docs_per_sec": round(num_docs / indexing_time, 2),
"status": "success"
}
except Exception as e:
return {
"database": "FAISS",
"status": "failed",
"error": str(e)
}
def benchmark_qdrant(self, num_docs: int) -> Dict[str, Any]:
"""Benchmark Qdrant"""
if QdrantClient is None:
return {"status": "failed", "error": "Qdrant not installed"}
try:
# Setup
db_path = "./benchmark_qdrant"
if os.path.exists(db_path):
shutil.rmtree(db_path)
client = QdrantClient(path=db_path)
collection_name = "benchmark"
# Create collection
client.create_collection(
collection_name=collection_name,
vectors_config=VectorParams(size=self.embedding_dim, distance=Distance.COSINE)
)
# Indexing benchmark
docs_subset = self.documents[:num_docs]
embeddings_subset = self.embeddings[:num_docs]
start_time = time.time()
memory_before = psutil.Process().memory_info().rss / 1024 / 1024
# Add documents in batches
batch_size = 100
for i in range(0, num_docs, batch_size):
batch_end = min(i + batch_size, num_docs)
batch_docs = docs_subset[i:batch_end]
batch_embeddings = embeddings_subset[i:batch_end]
points = [
PointStruct(
id=i + j,
vector=embedding.tolist(),
payload={
"text": doc["text"],
**doc["metadata"]
}
)
for j, (doc, embedding) in enumerate(zip(batch_docs, batch_embeddings))
]
client.upsert(
collection_name=collection_name,
points=points
)
indexing_time = time.time() - start_time
memory_after = psutil.Process().memory_info().rss / 1024 / 1024
# Query benchmark
query_times = []
for query_embedding in self.query_embeddings:
start_time = time.time()
results = client.search(
collection_name=collection_name,
query_vector=query_embedding.tolist(),
limit=5
)
query_times.append(time.time() - start_time)
avg_query_time = np.mean(query_times) * 1000 # ms
# Storage size
storage_size = sum(os.path.getsize(os.path.join(root, file))
for root, dirs, files in os.walk(db_path)
for file in files)
storage_size_mb = storage_size / 1024 / 1024
# Cleanup
client.delete_collection(collection_name)
shutil.rmtree(db_path)
return {
"database": "Qdrant",
"num_docs": num_docs,
"indexing_time_s": round(indexing_time, 3),
"avg_query_time_ms": round(avg_query_time, 3),
"memory_usage_mb": round(memory_after - memory_before, 2),
"storage_size_mb": round(storage_size_mb, 2),
"throughput_docs_per_sec": round(num_docs / indexing_time, 2),
"status": "success"
}
except Exception as e:
return {
"database": "Qdrant",
"status": "failed",
"error": str(e)
}
def run_comprehensive_benchmark(self) -> List[Dict[str, Any]]:
"""Run comprehensive benchmark across all databases and sizes"""
results = []
databases = [
("ChromaDB", self.benchmark_chromadb),
("FAISS", self.benchmark_faiss),
("Qdrant", self.benchmark_qdrant)
]
print("๐ Starting Vector Database Benchmark on Jetson")
print("=" * 60)
for db_name, benchmark_func in databases:
print(f"\n๐ Benchmarking {db_name}...")
for num_docs in self.test_sizes:
print(f" Testing with {num_docs} documents...")
result = benchmark_func(num_docs)
results.append(result)
if result["status"] == "success":
print(f" โ
Indexing: {result['indexing_time_s']:.2f}s, "
f"Query: {result['avg_query_time_ms']:.2f}ms, "
f"Memory: {result['memory_usage_mb']:.1f}MB")
else:
print(f" โ Failed: {result['error']}")
return results
def generate_comparison_report(self, results: List[Dict[str, Any]]) -> str:
"""Generate comprehensive comparison report"""
successful_results = [r for r in results if r["status"] == "success"]
if not successful_results:
return "No successful benchmarks to report."
report = "\n" + "=" * 80
report += "\n๐ JETSON VECTOR DATABASE BENCHMARK REPORT\n"
report += "=" * 80 + "\n"
# Performance comparison by database
databases = list(set(r["database"] for r in successful_results))
for db in databases:
db_results = [r for r in successful_results if r["database"] == db]
if not db_results:
continue
report += f"\n๐ {db} PERFORMANCE:\n"
report += "-" * 40 + "\n"
report += f"{'Docs':<8} {'Index(s)':<10} {'Query(ms)':<12} {'Memory(MB)':<12} {'Storage(MB)':<12}\n"
report += "-" * 40 + "\n"
for result in sorted(db_results, key=lambda x: x["num_docs"]):
report += f"{result['num_docs']:<8} {result['indexing_time_s']:<10.2f} "
report += f"{result['avg_query_time_ms']:<12.2f} {result['memory_usage_mb']:<12.1f} "
report += f"{result['storage_size_mb']:<12.1f}\n"
# Best performers analysis
report += "\n๐
BEST PERFORMERS:\n"
report += "-" * 40 + "\n"
# Fastest indexing
fastest_indexing = min(successful_results, key=lambda x: x["indexing_time_s"])
report += f"๐ Fastest Indexing: {fastest_indexing['database']} "
report += f"({fastest_indexing['indexing_time_s']:.2f}s for {fastest_indexing['num_docs']} docs)\n"
# Fastest query
fastest_query = min(successful_results, key=lambda x: x["avg_query_time_ms"])
report += f"โก Fastest Query: {fastest_query['database']} "
report += f"({fastest_query['avg_query_time_ms']:.2f}ms)\n"
# Most memory efficient
most_efficient = min(successful_results, key=lambda x: x["memory_usage_mb"])
report += f"๐พ Most Memory Efficient: {most_efficient['database']} "
report += f"({most_efficient['memory_usage_mb']:.1f}MB)\n"
# Smallest storage
smallest_storage = min(successful_results, key=lambda x: x["storage_size_mb"])
report += f"๐ฝ Smallest Storage: {smallest_storage['database']} "
report += f"({smallest_storage['storage_size_mb']:.1f}MB)\n"
# Recommendations
report += "\n๐ฏ RECOMMENDATIONS FOR JETSON:\n"
report += "-" * 40 + "\n"
report += "๐น Real-time applications: Choose fastest query database\n"
report += "๐น Large datasets: Consider indexing speed and storage efficiency\n"
report += "๐น Memory-constrained: Use most memory-efficient option\n"
report += "๐น Persistence needed: Avoid in-memory only solutions\n"
report += "๐น Complex filtering: Choose databases with rich metadata support\n"
return report
# Run the benchmark
vector_benchmark = VectorDBBenchmark()
vector_results = vector_benchmark.run_comprehensive_benchmark()
vector_report = vector_benchmark.generate_comparison_report(vector_results)
print(vector_report)
๐ Expected Vector Database Performance on Jetson Orin Nano¶
Database | Docs | Indexing (s) | Query (ms) | Memory (MB) | Storage (MB) |
---|---|---|---|---|---|
FAISS | 1000 | 0.12 | 0.8 | 15 | 1.5 |
FAISS | 5000 | 0.58 | 1.2 | 75 | 7.3 |
ChromaDB | 1000 | 2.3 | 12.5 | 45 | 8.2 |
ChromaDB | 5000 | 11.8 | 15.2 | 180 | 38.5 |
Qdrant | 1000 | 3.1 | 8.7 | 52 | 12.1 |
Qdrant | 5000 | 15.2 | 11.3 | 220 | 58.3 |
๐ Vector Database Recommendations¶
๐ FAISS - Best for Performance¶
Pros: - โ Fastest indexing and query performance - โ Minimal memory footprint - โ Excellent for large-scale similarity search - โ GPU acceleration support
Cons: - โ No built-in persistence (requires manual saving) - โ Limited metadata filtering capabilities - โ No distributed features
Best for: High-performance applications, real-time search, memory-constrained environments
๐ง ChromaDB - Best for Ease of Use¶
Pros: - โ Simple API and great developer experience - โ Built-in persistence - โ Good metadata filtering - โ Active community and documentation
Cons: - โ Slower than FAISS for large datasets - โ Higher memory usage - โ Limited scalability options
Best for: Prototyping, small to medium datasets, applications requiring persistence
โ๏ธ Qdrant - Best for Production¶
Pros: - โ Rich filtering and metadata support - โ Built-in persistence and backup - โ RESTful API for remote access - โ Horizontal scaling capabilities
Cons: - โ Higher resource usage - โ More complex setup - โ Slower for simple similarity search
Best for: Production applications, complex filtering requirements, distributed deployments
๐ NVIDIA cuVS: GPU-Accelerated Vector Search¶
๐ฏ Introduction to cuVS¶
NVIDIA cuVS (CUDA Vector Search) is a GPU-accelerated library for high-performance vector similarity search, specifically optimized for NVIDIA GPUs including Jetson platforms. It provides significant speedups over CPU-based vector search solutions.
๐ง Key Features:¶
- GPU Acceleration: Leverages CUDA cores for parallel vector operations
- Multiple Algorithms: Supports FAISS, HNSWLIB, and custom GPU implementations
- Memory Optimization: Efficient GPU memory management for large datasets
- Jetson Optimization: Specifically tuned for edge AI workloads
- Integration: Works seamlessly with existing ML pipelines
๐ฆ cuVS Installation on Jetson¶
# Install cuVS for Jetson (requires CUDA 11.8+)
# Method 1: Using conda (recommended)
conda install -c rapidsai -c conda-forge cuvs-cu11
# Method 2: Using pip
pip install cuvs-cu11
# Method 3: Build from source (for latest features)
git clone https://github.com/rapidsai/cuvs.git
cd cuvs
./build.sh
๐งช cuVS Performance Benchmark¶
import time
import numpy as np
from typing import List, Dict, Any
import psutil
from sentence_transformers import SentenceTransformer
try:
import cuvs
from cuvs import neighbors
except ImportError:
cuvs = None
print("cuVS not available. Install with: pip install cuvs-cu11")
try:
import cupy as cp
except ImportError:
cp = None
print("CuPy not available. Install with: pip install cupy")
class CuVSBenchmark:
"""Benchmark cuVS performance on Jetson"""
def __init__(self, embedding_model: str = "all-MiniLM-L6-v2"):
self.embedding_model = SentenceTransformer(embedding_model)
self.embedding_dim = 384
# Generate test data
self.documents = self._generate_test_documents()
self.embeddings = self._generate_embeddings()
# Test queries
self.query_texts = [
"NVIDIA Jetson AI computing",
"Deep learning optimization",
"Edge AI applications",
"Real-time inference",
"Computer vision processing"
]
self.query_embeddings = self.embedding_model.encode(self.query_texts)
def _generate_test_documents(self) -> List[str]:
"""Generate test documents for benchmarking"""
base_texts = [
"NVIDIA Jetson enables edge AI computing with GPU acceleration.",
"Deep learning models benefit from TensorRT optimization on Jetson.",
"Computer vision applications run efficiently on Jetson platforms.",
"Edge AI reduces latency and improves privacy for IoT devices.",
"Real-time inference is crucial for autonomous systems.",
"CUDA programming enables parallel processing on Jetson GPUs.",
"Machine learning at the edge transforms industrial automation.",
"Jetson Orin provides high-performance AI computing in compact form.",
"Neural networks can be optimized for embedded deployment.",
"Edge computing brings intelligence closer to data sources."
]
documents = []
for i in range(10000): # Generate 10k documents
base_idx = i % len(base_texts)
text = f"{base_texts[base_idx]} Document {i} with additional context."
documents.append(text)
return documents
def _generate_embeddings(self) -> np.ndarray:
"""Generate embeddings for test documents"""
return self.embedding_model.encode(
self.documents,
batch_size=64,
show_progress_bar=False
)
def benchmark_cuvs_ivf_flat(self, num_docs: int) -> Dict[str, Any]:
"""Benchmark cuVS IVF-Flat index"""
if cuvs is None or cp is None:
return {"status": "failed", "error": "cuVS or CuPy not available"}
try:
# Prepare data
embeddings_subset = self.embeddings[:num_docs].astype(np.float32)
# Move data to GPU
gpu_embeddings = cp.asarray(embeddings_subset)
# Build index
start_time = time.time()
memory_before = psutil.Process().memory_info().rss / 1024 / 1024
# Create IVF-Flat index
n_lists = min(int(np.sqrt(num_docs)), 1024) # Adaptive number of clusters
index_params = neighbors.ivf_flat.IndexParams(
n_lists=n_lists,
metric="cosine",
add_data_on_build=True
)
index = neighbors.ivf_flat.build(index_params, gpu_embeddings)
indexing_time = time.time() - start_time
memory_after = psutil.Process().memory_info().rss / 1024 / 1024
# Query benchmark
query_times = []
gpu_queries = cp.asarray(self.query_embeddings.astype(np.float32))
search_params = neighbors.ivf_flat.SearchParams(n_probes=min(n_lists, 50))
for i in range(len(self.query_embeddings)):
start_time = time.time()
query = gpu_queries[i:i+1] # Single query
distances, indices = neighbors.ivf_flat.search(
search_params, index, query, k=5
)
cp.cuda.Stream.null.synchronize() # Ensure GPU completion
query_times.append(time.time() - start_time)
avg_query_time = np.mean(query_times) * 1000 # ms
# GPU memory usage
gpu_memory_mb = cp.get_default_memory_pool().used_bytes() / 1024 / 1024
return {
"algorithm": "cuVS IVF-Flat",
"num_docs": num_docs,
"indexing_time_s": round(indexing_time, 3),
"avg_query_time_ms": round(avg_query_time, 3),
"memory_usage_mb": round(memory_after - memory_before, 2),
"gpu_memory_mb": round(gpu_memory_mb, 2),
"throughput_docs_per_sec": round(num_docs / indexing_time, 2),
"n_lists": n_lists,
"status": "success"
}
except Exception as e:
return {
"algorithm": "cuVS IVF-Flat",
"status": "failed",
"error": str(e)
}
def benchmark_cuvs_ivf_pq(self, num_docs: int) -> Dict[str, Any]:
"""Benchmark cuVS IVF-PQ index (memory optimized)"""
if cuvs is None or cp is None:
return {"status": "failed", "error": "cuVS or CuPy not available"}
try:
# Prepare data
embeddings_subset = self.embeddings[:num_docs].astype(np.float32)
gpu_embeddings = cp.asarray(embeddings_subset)
# Build index
start_time = time.time()
memory_before = psutil.Process().memory_info().rss / 1024 / 1024
# Create IVF-PQ index (Product Quantization for memory efficiency)
n_lists = min(int(np.sqrt(num_docs)), 512)
pq_bits = 8 # 8-bit quantization
pq_dim = min(self.embedding_dim // 4, 64) # PQ subspace dimension
index_params = neighbors.ivf_pq.IndexParams(
n_lists=n_lists,
metric="cosine",
pq_bits=pq_bits,
pq_dim=pq_dim,
add_data_on_build=True
)
index = neighbors.ivf_pq.build(index_params, gpu_embeddings)
indexing_time = time.time() - start_time
memory_after = psutil.Process().memory_info().rss / 1024 / 1024
# Query benchmark
query_times = []
gpu_queries = cp.asarray(self.query_embeddings.astype(np.float32))
search_params = neighbors.ivf_pq.SearchParams(
n_probes=min(n_lists, 25),
lut_dtype=cp.float32
)
for i in range(len(self.query_embeddings)):
start_time = time.time()
query = gpu_queries[i:i+1]
distances, indices = neighbors.ivf_pq.search(
search_params, index, query, k=5
)
cp.cuda.Stream.null.synchronize()
query_times.append(time.time() - start_time)
avg_query_time = np.mean(query_times) * 1000 # ms
# GPU memory usage
gpu_memory_mb = cp.get_default_memory_pool().used_bytes() / 1024 / 1024
return {
"algorithm": "cuVS IVF-PQ",
"num_docs": num_docs,
"indexing_time_s": round(indexing_time, 3),
"avg_query_time_ms": round(avg_query_time, 3),
"memory_usage_mb": round(memory_after - memory_before, 2),
"gpu_memory_mb": round(gpu_memory_mb, 2),
"throughput_docs_per_sec": round(num_docs / indexing_time, 2),
"compression_ratio": round(32 / pq_bits, 1), # Memory compression
"status": "success"
}
except Exception as e:
return {
"algorithm": "cuVS IVF-PQ",
"status": "failed",
"error": str(e)
}
def benchmark_cuvs_cagra(self, num_docs: int) -> Dict[str, Any]:
"""Benchmark cuVS CAGRA index (GPU-optimized graph)"""
if cuvs is None or cp is None:
return {"status": "failed", "error": "cuVS or CuPy not available"}
try:
# Prepare data
embeddings_subset = self.embeddings[:num_docs].astype(np.float32)
gpu_embeddings = cp.asarray(embeddings_subset)
# Build index
start_time = time.time()
memory_before = psutil.Process().memory_info().rss / 1024 / 1024
# Create CAGRA index (GPU-optimized graph-based)
index_params = neighbors.cagra.IndexParams(
intermediate_graph_degree=64,
graph_degree=32,
build_algo="nn_descent"
)
index = neighbors.cagra.build(index_params, gpu_embeddings)
indexing_time = time.time() - start_time
memory_after = psutil.Process().memory_info().rss / 1024 / 1024
# Query benchmark
query_times = []
gpu_queries = cp.asarray(self.query_embeddings.astype(np.float32))
search_params = neighbors.cagra.SearchParams(
itopk_size=64,
search_width=1,
max_iterations=0,
algo="single_cta"
)
for i in range(len(self.query_embeddings)):
start_time = time.time()
query = gpu_queries[i:i+1]
distances, indices = neighbors.cagra.search(
search_params, index, query, k=5
)
cp.cuda.Stream.null.synchronize()
query_times.append(time.time() - start_time)
avg_query_time = np.mean(query_times) * 1000 # ms
# GPU memory usage
gpu_memory_mb = cp.get_default_memory_pool().used_bytes() / 1024 / 1024
return {
"algorithm": "cuVS CAGRA",
"num_docs": num_docs,
"indexing_time_s": round(indexing_time, 3),
"avg_query_time_ms": round(avg_query_time, 3),
"memory_usage_mb": round(memory_after - memory_before, 2),
"gpu_memory_mb": round(gpu_memory_mb, 2),
"throughput_docs_per_sec": round(num_docs / indexing_time, 2),
"graph_degree": 32,
"status": "success"
}
except Exception as e:
return {
"algorithm": "cuVS CAGRA",
"status": "failed",
"error": str(e)
}
def run_cuvs_benchmark(self) -> List[Dict[str, Any]]:
"""Run comprehensive cuVS benchmark"""
results = []
test_sizes = [1000, 5000, 10000]
algorithms = [
("IVF-Flat", self.benchmark_cuvs_ivf_flat),
("IVF-PQ", self.benchmark_cuvs_ivf_pq),
("CAGRA", self.benchmark_cuvs_cagra)
]
print("๐ Starting cuVS Benchmark on Jetson")
print("=" * 50)
for algo_name, benchmark_func in algorithms:
print(f"\n๐ฌ Testing {algo_name}...")
for num_docs in test_sizes:
print(f" ๐ {num_docs} documents...")
result = benchmark_func(num_docs)
results.append(result)
if result["status"] == "success":
print(f" โ
Index: {result['indexing_time_s']:.2f}s, "
f"Query: {result['avg_query_time_ms']:.2f}ms")
else:
print(f" โ Failed: {result['error']}")
return results
def generate_cuvs_report(self, results: List[Dict[str, Any]]) -> str:
"""Generate cuVS performance report"""
successful_results = [r for r in results if r["status"] == "success"]
if not successful_results:
return "No successful cuVS benchmarks to report."
report = "\n" + "=" * 70
report += "\n๐ NVIDIA cuVS PERFORMANCE REPORT ON JETSON\n"
report += "=" * 70 + "\n"
# Performance by algorithm
algorithms = list(set(r["algorithm"] for r in successful_results))
for algo in algorithms:
algo_results = [r for r in successful_results if r["algorithm"] == algo]
if not algo_results:
continue
report += f"\n๐ฌ {algo} PERFORMANCE:\n"
report += "-" * 50 + "\n"
report += f"{'Docs':<8} {'Index(s)':<10} {'Query(ms)':<12} {'GPU(MB)':<10}\n"
report += "-" * 50 + "\n"
for result in sorted(algo_results, key=lambda x: x["num_docs"]):
report += f"{result['num_docs']:<8} {result['indexing_time_s']:<10.2f} "
report += f"{result['avg_query_time_ms']:<12.2f} {result['gpu_memory_mb']:<10.1f}\n"
# Performance comparison
if len(successful_results) > 1:
fastest_query = min(successful_results, key=lambda x: x["avg_query_time_ms"])
fastest_index = min(successful_results, key=lambda x: x["indexing_time_s"])
most_efficient = min(successful_results, key=lambda x: x["gpu_memory_mb"])
report += "\n๐
BEST PERFORMERS:\n"
report += "-" * 30 + "\n"
report += f"โก Fastest Query: {fastest_query['algorithm']} ({fastest_query['avg_query_time_ms']:.2f}ms)\n"
report += f"๐ Fastest Index: {fastest_index['algorithm']} ({fastest_index['indexing_time_s']:.2f}s)\n"
report += f"๐พ Most Efficient: {most_efficient['algorithm']} ({most_efficient['gpu_memory_mb']:.1f}MB)\n"
# Recommendations
report += "\n๐ฏ cuVS RECOMMENDATIONS:\n"
report += "-" * 30 + "\n"
report += "๐น IVF-Flat: Best for accuracy and moderate datasets\n"
report += "๐น IVF-PQ: Best for memory-constrained large datasets\n"
report += "๐น CAGRA: Best for ultra-fast queries on GPU\n"
report += "๐น Use GPU memory pooling for better performance\n"
return report
# Run cuVS benchmark
if cuvs is not None and cp is not None:
cuvs_benchmark = CuVSBenchmark()
cuvs_results = cuvs_benchmark.run_cuvs_benchmark()
cuvs_report = cuvs_benchmark.generate_cuvs_report(cuvs_results)
print(cuvs_report)
else:
print("โ ๏ธ cuVS or CuPy not available. Install to run GPU-accelerated benchmarks.")
๐ Expected cuVS Performance on Jetson Orin¶
Algorithm | Docs | Indexing (s) | Query (ms) | GPU Memory (MB) | Speedup vs CPU |
---|---|---|---|---|---|
IVF-Flat | 1000 | 0.08 | 0.3 | 12 | 15x |
IVF-Flat | 10000 | 0.45 | 0.5 | 95 | 18x |
IVF-PQ | 1000 | 0.12 | 0.4 | 8 | 12x |
IVF-PQ | 10000 | 0.68 | 0.7 | 35 | 14x |
CAGRA | 1000 | 0.25 | 0.2 | 18 | 25x |
CAGRA | 10000 | 1.2 | 0.3 | 120 | 30x |
๐ง Embedding Optimization Strategies¶
1. Model Quantization¶
import torch
from sentence_transformers import SentenceTransformer
def quantize_embedding_model(model_name: str):
"""Quantize embedding model for Jetson deployment"""
model = SentenceTransformer(model_name)
# Dynamic quantization (CPU)
quantized_model = torch.quantization.quantize_dynamic(
model[0].auto_model,
{torch.nn.Linear},
dtype=torch.qint8
)
# Replace the transformer in the model
model[0].auto_model = quantized_model
return model
# Usage
quantized_model = quantize_embedding_model("all-MiniLM-L6-v2")
2. TensorRT Optimization¶
import tensorrt as trt
import torch
from torch2trt import torch2trt
def optimize_with_tensorrt(model, input_shape):
"""Optimize embedding model with TensorRT"""
model.eval()
# Create example input
x = torch.ones(input_shape).cuda()
# Convert to TensorRT
model_trt = torch2trt(
model,
[x],
fp16_mode=True, # Use FP16 for better performance
max_workspace_size=1<<25 # 32MB workspace
)
return model_trt
3. Batch Processing Optimization¶
class OptimizedEmbeddingProcessor:
"""Optimized embedding processing for Jetson"""
def __init__(self, model_name: str, batch_size: int = 32):
self.model = SentenceTransformer(model_name)
self.batch_size = batch_size
# Enable GPU if available
if torch.cuda.is_available():
self.model = self.model.cuda()
def encode_optimized(self, texts: List[str]) -> np.ndarray:
"""Optimized batch encoding"""
embeddings = []
# Process in optimized batches
for i in range(0, len(texts), self.batch_size):
batch = texts[i:i + self.batch_size]
with torch.no_grad(): # Disable gradients for inference
batch_embeddings = self.model.encode(
batch,
convert_to_numpy=True,
show_progress_bar=False,
normalize_embeddings=True # Normalize for cosine similarity
)
embeddings.append(batch_embeddings)
return np.vstack(embeddings)
๐ฏ Integration with LangChain¶
from langchain.embeddings.base import Embeddings
from typing import List
class CuVSEmbeddings(Embeddings):
"""LangChain-compatible cuVS embeddings"""
def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
self.processor = OptimizedEmbeddingProcessor(model_name)
def embed_documents(self, texts: List[str]) -> List[List[float]]:
"""Embed a list of documents"""
embeddings = self.processor.encode_optimized(texts)
return embeddings.tolist()
def embed_query(self, text: str) -> List[float]:
"""Embed a single query"""
embedding = self.processor.encode_optimized([text])
return embedding[0].tolist()
# Usage with LangChain
from langchain.vectorstores import FAISS
from langchain.schema import Document
# Create optimized embeddings
embeddings = CuVSEmbeddings()
# Create documents
documents = [
Document(page_content="NVIDIA Jetson enables edge AI computing."),
Document(page_content="cuVS provides GPU-accelerated vector search."),
Document(page_content="LangChain simplifies RAG application development.")
]
# Create vector store with optimized embeddings
vectorstore = FAISS.from_documents(documents, embeddings)
# Perform similarity search
query = "What is Jetson used for?"
results = vectorstore.similarity_search(query, k=2)
print(f"Query: {query}")
for i, doc in enumerate(results):
print(f"Result {i+1}: {doc.page_content}")
๐ Performance Summary¶
Optimization | Speedup | Memory Reduction | Accuracy Impact |
---|---|---|---|
cuVS GPU | 15-30x | - | Minimal |
Model Quantization | 2-3x | 75% | <2% |
TensorRT FP16 | 1.5-2x | 50% | <1% |
Batch Processing | 3-5x | - | None |
Combined | 50-100x | 60% | <3% |
๐งช Lab: Build RAG App with Multiple Backends on Jetson¶
๐งฐ Setup¶
pip install langchain llama-cpp-python chromadb faiss-cpu qdrant-client sentence-transformers
๐น Step 1: Load and Split Document¶
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
loader = TextLoader("data/jetson_guide.txt")
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=50)
chunks = splitter.split_documents(docs)
๐น Step 2: Embed and Index (Choose Backend)¶
Option A: ChromaDB¶
from langchain.embeddings import SentenceTransformerEmbeddings
from langchain.vectorstores import Chroma
embedding = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = Chroma.from_documents(chunks, embedding, persist_directory="db_chroma")
Option B: FAISS¶
from langchain.vectorstores import FAISS
faiss_store = FAISS.from_documents(chunks, embedding)
Option C: Qdrant (self-hosted or remote)¶
from langchain.vectorstores import Qdrant
from qdrant_client import QdrantClient
client = QdrantClient(path="./qdrant_data")
qdrant_store = Qdrant.from_documents(chunks, embedding, client=client, collection_name="jetson_docs")
Convert any vector store to retriever:
retriever = vectorstore.as_retriever()
๐น Step 3: RAG with Multiple Model Inference Backends¶
โ llama-cpp Backend (Local GGUF Model)¶
from langchain.llms import LlamaCpp
from langchain.chains import RetrievalQA
llm = LlamaCpp(model_path="/models/mistral.gguf", n_gpu_layers=80)
qa = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)
print(qa.run("What is Jetson Orin Nano used for?"))
โ Ollama Backend (Local REST API)¶
from langchain.llms import OpenAI
ollama_llm = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
qa_ollama = RetrievalQA.from_chain_type(llm=ollama_llm, retriever=retriever)
print(qa_ollama.run("What is Jetson Orin Nano used for?"))
๐ Lab Deliverables¶
- Run the same query across multiple vector DBs and model backends
-
Record differences in:
-
Latency
- Answer quality
- Memory usage
- Submit a table comparing results
๐ก Use Cases for Jetson Edge¶
- Campus FAQ bots with private syllabus
- On-device document search (manuals, code docs)
- Assistive RAG chatbot with no internet
โ Summary¶
- RAG augments LLMs with context-aware search
- Vector DB options: Chroma, FAISS, Qdrant (all lightweight and Jetson-compatible)
- Inference backends: llama.cpp and Ollama, both support GGUF models
- Jetson can handle small-to-medium scale RAG locally with optimized models
โ Next: Local AI Agents