Skip to content

๐Ÿ›๏ธ Advanced CNNs, Optimization & Image Processing on Jetson

Prerequisite: 04 โ€” Deep Learning & CNNs covers the fundamentals (what a CNN is, layer types, the jetson_cnn_toolkit.py quick-start). This chapter goes deeper: advanced architectures, Jetson-specific optimization (quantization, TensorRT), and real-time image-processing pipelines.

๐ŸŽฏ Learning Objectives

  • Compare advanced CNN architectures (ResNet, MobileNet, EfficientNet) on Jetson
  • Apply Jetson optimizations: FP16/INT8 and the PyTorch โ†’ ONNX โ†’ TensorRT pipeline
  • Build an image-classification + camera pipeline with OpenCV

๐Ÿš€ Environment: the Jetson container

Everything here runs inside the jetson-dev container (PyTorch CUDA, torchvision, OpenCV, TensorRT all preinstalled). Start it and move to the toolkit:

sjsujetsontool shell                       # host folder /Developer is mounted into the container
cd /Developer/edgeAI/edgeLLM               # this repo lives at /Developer/edgeAI
python3 jetson_cnn_toolkit.py --help

The toolkit exposes four models โ€” basiccnn, resnet, mobilenet, efficientnet โ€” and the modes demo | benchmark | train | inference | optimize. We avoid train in labs (it downloads CIFAR-10 and runs for a long time); demo/benchmark need neither training nor data.


๐Ÿ—๏ธ Advanced CNN Architectures

ResNet โ€” Residual Networks

ResNet introduced skip connections so gradients flow directly through deep networks, solving the vanishing-gradient/degradation problem. - Residual block: output = F(x) + x (identity shortcut) - Bottleneck design: 1ร—1 convs cut compute while preserving capacity - In the toolkit: CustomResNet (built from ResidualBlock), ~11M params.

MobileNet โ€” efficient by design

MobileNet replaces standard convolutions with depthwise-separable convolutions (a depthwise conv + a 1ร—1 pointwise conv), cutting compute ~8โ€“9ร—. - Width multiplier scales channels (0.25โ€“1.0) to trade accuracy for speed - Best speed-to-size ratio on the Orin Nano โ†’ the go-to edge architecture - In the toolkit: MobileNet, ~2.2M params.

EfficientNet โ€” compound scaling

EfficientNet compound-scales depth, width, and resolution together, using MBConv blocks with squeeze-and-excitation attention. - Strong accuracy per FLOP; the base B0 is edge-friendly - In the toolkit: EfficientNet (B0-style), ~3.5M params.

See the trade-offs yourself

python3 jetson_cnn_toolkit.py --mode demo --model all --device cuda
  basiccnn     | params=  1,147,914 | out=(32, 10) |   2.9 ms/batch | 10965 img/s
  resnet       | params= 11,181,642 | out=(32, 10) |   8.6 ms/batch |  3707 img/s
  mobilenet    | params=  2,155,338 | out=(32, 10) |   6.0 ms/batch |  5333 img/s
  efficientnet | params=  3,472,714 | out=(32, 10) |   9.6 ms/batch |  3330 img/s
Note how mobilenet delivers far more img/s per parameter than resnet โ€” the essence of edge-efficient design.


โš™๏ธ Jetson-Specific Optimization

1. Precision: FP16 and INT8

Lower precision means less memory and faster math on Jetson's Tensor Cores: - FP16 (half precision): ~2ร— speedup, negligible accuracy loss โ€” the easy default. - INT8 (quantization): ~4ร— smaller, fastest, needs calibration data to preserve accuracy.

2. Benchmark a model's inference

benchmark mode times inference without training (uses synthetic data unless --dataset cifar10):

python3 jetson_cnn_toolkit.py --mode benchmark --model mobilenet --dataset custom
python3 jetson_cnn_toolkit.py --mode benchmark --model all --dataset custom   # compare all
Results (FPS, latency, memory) are written to outputs/benchmark_results.json with a chart in outputs/benchmark_comparison.png.

3. TensorRT optimization

optimize mode exports your trained model to a TensorRT engine. It needs a weights file (from --mode train) and a Jetson with TensorRT:

python3 jetson_cnn_toolkit.py --mode optimize --model resnet \
    --weights outputs/resnet_cifar10_best.pth --precision fp16

Requires a trained .pth. If you haven't trained, use the manual PyTorch โ†’ ONNX โ†’ TensorRT workflow below, which works with pretrained torchvision weights.


โšก TensorRT: PyTorch โ†’ ONNX โ†’ TensorRT

TensorRT is NVIDIA's inference optimizer/runtime. The standard path converts a model to ONNX, then builds a TensorRT engine.

1. Export a pretrained model to ONNX (run in the container):

import torch, torchvision.models as models
model = models.resnet18(weights="IMAGENET1K_V1").eval()
dummy = torch.randn(1, 3, 224, 224)
torch.onnx.export(model, dummy, "resnet18.onnx", opset_version=13,
                  input_names=["input"], output_names=["output"])
print("exported resnet18.onnx")

2. Build a TensorRT engine with trtexec:

# FP16 engine
trtexec --onnx=resnet18.onnx --saveEngine=resnet18_fp16.trt --fp16
# this also prints throughput/latency for the optimized engine

3. Compare the PyTorch vs TensorRT latency โ€” TensorRT typically gives 2โ€“5ร— lower latency on Jetson.


๐Ÿ–ผ๏ธ Image Processing Tools on Jetson

Tool / Library Purpose
OpenCV (cv2) Real-time image/video processing, camera capture
Pillow (PIL) Image loading and conversion
PyTorch / TensorRT CNN inference
v4l2-ctl Inspect/configure cameras
GStreamer Hardware-accelerated media/camera pipelines

Classify an image with a pretrained CNN

A complete runnable example (uses torchvision ImageNet weights โ€” no training):

import torch, torchvision.models as models
import torchvision.transforms as T
from PIL import Image

model = models.mobilenet_v2(weights="IMAGENET1K_V1").eval().cuda()
tf = T.Compose([T.Resize(256), T.CenterCrop(224), T.ToTensor(),
                T.Normalize([0.485,0.456,0.406], [0.229,0.224,0.225])])
x = tf(Image.open("cat.jpg").convert("RGB")).unsqueeze(0).cuda()
with torch.no_grad():
    probs = model(x).softmax(1)
top = probs.topk(3)
print("top-3 class ids:", top.indices[0].tolist(), "scores:", top.values[0].tolist())

Capture from a camera with OpenCV

import cv2
cap = cv2.VideoCapture(0)              # USB camera; CSI cameras use a GStreamer pipeline
ok, frame = cap.read()
if ok:
    cv2.imwrite("frame.jpg", frame)
    print("captured", frame.shape)
cap.release()

๐Ÿงช Lab: Architecture comparison & optimization

Goal: quantify the accuracy/speed/size trade-offs and accelerate a model โ€” without long training.

  1. Compare architectures (no training):

    python3 jetson_cnn_toolkit.py --mode demo --model all --device cuda
    
    Record params and img/s for each. Which gives the best img/s per million params?

  2. Benchmark a chosen model and inspect the JSON/chart:

    python3 jetson_cnn_toolkit.py --mode benchmark --model mobilenet --dataset custom
    

  3. Accelerate with TensorRT using the PyTorch โ†’ ONNX โ†’ trtexec --fp16 workflow above on resnet18, and compare the reported latency to plain PyTorch.

  4. (Optional, slow) Train basiccnn on CIFAR-10 for a few epochs and run --mode inference on the saved weights.

Deliverables

  • A table of model ยท params ยท img/s (from step 1)
  • benchmark_results.json + the comparison chart (step 2)
  • PyTorch-vs-TensorRT latency numbers (step 3)
  • One paragraph: which model would you deploy on the Orin Nano, and why?

๐Ÿ“Œ Summary

  • Architecture matters: depthwise-separable (MobileNet) and compound scaling (EfficientNet) buy speed/size on the edge; skip connections (ResNet) buy depth.
  • Optimize for Jetson: FP16 is the easy 2ร— win; INT8/TensorRT go further with calibration.
  • jetson_cnn_toolkit.py gives you demo/benchmark for instant, training-free experiments, plus train/inference/optimize for the full pipeline.
  • OpenCV + TensorRT turn a trained model into a real-time camera classifier.

โ†’ Next: Transformers & NLP on Jetson