π§ CUDA Programming Fundamentals and Jetson CUDA examples¶
Author: Dr. Kaikai Liu, Ph.D.
Position: Associate Professor, Computer Engineering
Institution: San Jose State University
Contact: kaikai.liu@sjsu.edu
πΉ What is CUDA?¶
CUDA (Compute Unified Device Architecture) is NVIDIA's parallel computing platform and programming model that enables developers to harness GPU power for general-purpose computing. Unlike traditional CPU programming, CUDA allows you to write programs that execute thousands of threads simultaneously on the GPU.
Key CUDA Concepts:¶
Concept | Description | Example |
---|---|---|
Kernel | Function that runs on GPU | __global__ void add_vectors(float* a, float* b, float* c) |
Thread | Basic execution unit | Each thread processes one array element |
Block | Group of threads (up to 1024) | 256 threads per block |
Grid | Collection of blocks | 1000 blocks in a grid |
Warp | 32 threads executed together | Hardware scheduling unit |
CUDA Memory Hierarchy:¶
βββββββββββββββββββββββββββββββββββββββββββ
β Global Memory β β Largest, slowest (LPDDR5)
β (8GB) β
βββββββββββββββββββββββββββββββββββββββββββ€
β L2 Cache (2MB) β β Shared across SMs
βββββββββββββββββββββββββββββββββββββββββββ€
β SM 0 β SM 1 β ... β SM 15 β
β βββββββ β βββββββ β β βββββββ β
β βL1/ β β βL1/ β β β βL1/ β β β Fast, per-SM
β βShr β β βShr β β β βShr β β
β β64KB β β β64KB β β β β64KB β β
β βββββββ β βββββββ β β βββββββ β
β Regs β Regs β β Regs β β Fastest, per-thread
β 64K β 64K β β 64K β
βββββββββββββββββββββββββββββββββββββββββββ
π― Jetson Orin Nano CUDA Specifications¶
Feature | Jetson Orin Nano | Desktop RTX 4060 | Data Center A100 |
---|---|---|---|
CUDA Cores | 512 | 3072 | 6912 |
Tensor Cores | 16 (3rd Gen) | 128 (3rd Gen) | 432 (3rd Gen) |
SMs | 16 | 24 | 108 |
Memory | 8GB LPDDR5 | 8GB GDDR6 | 40/80GB HBM2e |
Memory Bandwidth | 102 GB/s | 272 GB/s | 1555 GB/s |
Power | 7-25W | 115W | 400W |
CUDA Compute | 8.7 | 8.9 | 8.0 |
Key Insight: Jetson Orin Nano provides excellent CUDA capability per watt, making it ideal for edge AI applications where power efficiency is critical.
π§ CUDA Development on Jetson¶
Jetson supports native CUDA (C/C++) for fine-grained control. Ideal for performance-critical compute kernels.
NVIDIA CUDA Compiler (NVCC): NVCC is a compiler driver provided by NVIDIA for compiling CUDA C/C++ programs. It's a toolchain that manages the compilation process, generating binary executables containing both host (CPU) code and device (GPU) code.
Install CUDA nvcc: Even though JetPack 6.2 (L4T 36.4.3) includes CUDA 12.6, the nvcc command is not installed by default on Jetson devices starting from JetPack 6.x. CUDA is split into host and device components. On Jetson, only the runtime components of CUDA are installed by default (for deploying and running models). The full CUDA toolkit (including nvcc, compiler, samples, etc.) is now optional.
We already installed the full CUDA toolkit in our provided Jetson image. You can check the nvcc
command by running:
sjsujetson@sjsujetson-01:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Aug_14_10:14:07_PDT_2024
Cuda compilation tools, release 12.6, V12.6.68
Build cuda_12.6.r12.6/compiler.34714021_0
~/.bashrc
export PATH=/usr/local/cuda-12.6/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-12.6/lib64:$LD_LIBRARY_PATH
π§ͺ Run CUDA Samples¶
To build cuda samples, we need to use the new version of cmake (already downloaded in /Developer
)
sjsujetson@sjsujetson-01:~$ export PATH=/Developer/cmake-3.28.3-linux-aarch64/bin:$PATH
sjsujetson@sjsujetson-01:~$ cmake --version
cmake version 3.28.3
CMake suite maintained and supported by Kitware (kitware.com/cmake).
cuda-samples
folder is also available in the /Developer
folder, you can run cudda samples inside the build folder:
sjsujetson@sjsujetson-01:/Developer/cuda-samples/build$ ./0_Introduction/vectorAdd/vectorAdd
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
sjsujetson@sjsujetson-01:/Developer/cuda-samples/build$ ./1_Utilities/deviceQuery/deviceQuery
...
Result = PASS
sjsujetson@sjsujetson-01:/Developer/cuda-samples/build$ ./0_Introduction/matrixMul/matrixMul
sjsujetson@sjsujetson-01:/Developer/cuda-samples/build$ ./0_Introduction/asyncAPI/asyncAPI #Demonstrates overlapping data transfers and kernel execution using CUDA streams.
sjsujetson@sjsujetson-01:/Developer/cuda-samples/build$ ./0_Introduction/UnifiedMemoryStreams/UnifiedMemoryStreams #Demonstrates using Unified Memory with async memory prefetching
sjsujetson@sjsujetson-01:/Developer/cuda-samples/build$ ./2_Concepts_and_Techniques/convolutionTexture/convolutionTexture #Demonstrates using CUDAβs texture memory to do efficient image convolution.
sjsujetson@sjsujetson-01:/Developer/cuda-samples/build$ ./2_Concepts_and_Techniques/histogram/histogram #Computes histograms with/without shared memory.
If you made changes to the sample code, and want to rebuild it, you can use
sjsujetson@sjsujetson-01:/Developer/cuda-samples/build$ cmake ../Samples -DCMAKE_CUDA_ARCHITECTURES="72;87"
-- Configuring done (14.4s)
-- Generating done (0.7s)
-- Build files have been written to: /Developer/cuda-samples/build
sjsujetson@sjsujetson-01:/Developer/cuda-samples/build$ make -j$(nproc)
You can get the cuda architecture from
deviceQuery | grep "CUDA Capability"
CUDA Capability Major/Minor version number: 8.7
NVIDIA CUDA Profiling Tools¶
1. NVIDIA Nsight Systems - Installation in Container¶
NVIDIA Nsight installation: Nsight system is expected to be install manually if required. Please run below within the container:
$ sjsujetsontool shell #enter into container
root@sjsujetson-01:/Developer/cuda-samples/build# apt update
\
#download the nsight package from https://repo.download.nvidia.com/jetson/#Jetpack%206.1/6.2/6.2.1
root@sjsujetson-01:/Developer# wget https://repo.download.nvidia.com/jetson/common/pool/main/n/nsight-systems-2024.5.4/nsight-systems-2024.5.4_2024.5.4.34-245434855735v0_arm64.deb
#inspect the package
root@sjsujetson-01:/Developer# dpkg -c nsight-systems-2024.5.4_2024.5.4.34-245434855735v0_arm64.deb
root@sjsujetson-01:/Developer# apt install -y ./nsight-systems-2024.5.4_2024.5.4.34-245434855735v0_arm64.deb
root@sjsujetson-01:/Developer# nsys --version
NVIDIA Nsight Systems version 2024.5.4.34-245434855735v0
root@sjsujetson-01:/Developer# which nsys
/usr/local/bin/nsys
2. NVIDIA Nsight Systems - System-wide profiling¶
# Profile a CUDA application
root@sjsujetson-01:/Developer/cuda-samples/build# nsys profile --stats=true ./0_Introduction/vectorAdd/vectorAdd
....
Generated:
/Developer/cuda-samples/build/report2.nsys-rep
/Developer/cuda-samples/build/report2.sqlite
report2.nsys-rep
is the Main profiling report (for Nsight Systems UI or CLI import/export); report2.sqlite
is the Internal database format (used for CLI stats and custom scripts)
If you have Nsight Systems GUI installed on your host PC (x86) or remote desktop session, you can Open GUI on Host Machine: nsys-ui /path/to/report2.nsys-rep
.
You can also get high-level summaries without the GUI:
root@sjsujetson-01:/Developer/cuda-samples/build# nsys stats --force-export=true /Developer/cuda-samples/build/report2.nsys-rep
Generating SQLite file /Developer/cuda-samples/build/report2.sqlite from /Developer/cuda-samples/build/report2.nsys-rep
Processing [/Developer/cuda-samples/build/report2.sqlite] with [/opt/nvidia/nsight-systems/2024.5.4/host-linux-armv8/reports/nvtx_sum.py]...
** OS Runtime Summary (osrt_sum):
** CUDA API Summary (cuda_api_sum):
** CUDA GPU Kernel Summary (cuda_gpu_kern_sum):
** CUDA GPU MemOps Summary (by Time) (cuda_gpu_mem_time_sum):
** CUDA GPU MemOps Summary (by Size) (cuda_gpu_mem_size_sum):
root@sjsujetson-01:/Developer/cuda-samples/build# nsys stats --format csv --output report2_summary --force-export=true /Developer/cuda-samples/build/report2.nsys-rep
root@sjsujetson-01:/Developer/cuda-samples/build# ls *.csv
report2_summary_cuda_api_sum.csv
report2_summary_cuda_gpu_kern_sum.csv
report2_summary_cuda_gpu_mem_size_sum.csv
report2_summary_cuda_gpu_mem_time_sum.csv
report2_summary_dx11_pix_sum.csv
report2_summary_dx12_gpu_marker_sum.csv
report2_summary_dx12_pix_sum.csv
report2_summary_nvtx_sum.csv
report2_summary_openacc_sum.csv
report2_summary_opengl_khr_gpu_range_sum.csv
report2_summary_opengl_khr_range_sum.csv
report2_summary_openmp_sum.csv
report2_summary_osrt_sum.csv
report2_summary_syscall_sum.csv
report2_summary_um_cpu_page_faults_sum.csv
report2_summary_um_sum.csv
report2_summary_um_total_sum.csv
report2_summary_vulkan_gpu_marker_sum.csv
report2_summary_vulkan_marker_sum.csv
report2_summary_wddm_queue_sum.csv
Generate timeline view
root@sjsujetson-01:/Developer/cuda-samples/build# nsys profile -o my_profile ./0_Introduction/vectorAdd/vectorAdd
Collecting data...
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
Generating '/tmp/nsys-report-d475.qdstrm'
[1/1] [========================100%] my_profile.nsys-rep
Generated:
/Developer/cuda-samples/build/my_profile.nsys-rep
nsys-ui
3. NVIDIA Nsight Compute - Kernel-level profiling¶
Install Nvidia Nsight Compute, ncu = Nsight Compute CLI profiler for GPU kernel-level profiling (e.g., occupancy, memory throughput, instruction-level analysis). Itβs part of the Nsight Compute package, separate from nsys (Nsight Systems).
root@sjsujetson-01:/Developer# wget https://repo.download.nvidia.com/jetson/dgpu-rm/pool/main/n/nsight-compute/nsight-compute-2024.3.1_2024.3.1.2-1_arm64.deb
root@sjsujetson-01:/Developer# apt install ./nsight-compute-2024.3.1_2024.3.1.2-1_arm64.deb
root@sjsujetson-01:/Developer# ls /opt/nvidia/nsight-compute/2024.3.1/
docs extras host ncu ncu-ui sections target
root@sjsujetson-01:/Developer# export PATH=/opt/nvidia/nsight-compute/2024.3.1:$PATH
root@sjsujetson-01:/Developer# ncu --version
NVIDIA (R) Nsight Compute Command Line Profiler
Copyright (c) 2018-2024 NVIDIA Corporation
Version 2024.3.1.0 (build 34702747) (public-release)
make SMS="all" debug=1
), then run the following
# Detailed kernel analysis
root@sjsujetson-01:/Developer/cuda-samples/build# ncu --set full ./0_Introduction/vectorAdd/vectorAdd
# Memory throughput analysis
ncu --metrics dram__throughput.avg.pct_of_peak_sustained_elapsed ./0_Introduction/vectorAdd/vectorAdd
π CUDA Programming Model on Jetson Orin Nano¶
π Complete CUDA Programming Examples¶
1. Vector Addition Example¶
#create vector_add.cu - Complete CUDA Vector Addition Example
root@sjsujetson-01:/Developer/mycuda_samples# nano vector_add.cu
root@sjsujetson-01:/Developer/mycuda_samples# nvcc -o vector_add vector_add.cu -arch=sm_87
root@sjsujetson-01:/Developer/mycuda_samples# ls
vector_add vector_add.cu vector_multiply.cu
root@sjsujetson-01:/Developer/mycuda_samples# ./vector_add
Vector addition completed in 72848 microseconds
Result: PASSED
2. Vector Multiplication Example¶
# add vector_multiply.cu - Element-wise Vector Multiplication
root@sjsujetson-01:/Developer/mycuda_samples# nvcc -o vector_multiply vector_multiply.cu -arch=sm_87
root@sjsujetson-01:/Developer/mycuda_samples# ./vector_multiply
Result: PASSED
3. Matrix Multiplication Example¶
#create matrix_multiply.cu - CUDA Matrix Multiplication
root@sjsujetson-01:/Developer/mycuda_samples# nano matrix_multiply.cu
root@sjsujetson-01:/Developer/mycuda_samples# nvcc -O2 -lineinfo -arch=sm_87 -o matrix_multiply matrix_multiply.cu
./matrix_multiply
Result: PASSED
4. Advanced Memory Management Example¶
# create memory_management.cu - Advanced CUDA Memory Management
root@sjsujetson-01:/Developer/mycuda_samples# nvcc -O2 -lineinfo -arch=sm_87 -o memory_management memory_management.cu
π οΈ Build Instructions¶
1. Create Makefile¶
# Makefile for CUDA examples
NVCC = nvcc
CUDA_FLAGS = -std=c++11 -O3 -arch=sm_87 # sm_87 for Jetson Orin, adjust for your device
CXX_FLAGS = -std=c++11 -O3
# Targets
all: vector_add vector_multiply matrix_multiply memory_demo
vector_add: vector_add.cu
$(NVCC) $(CUDA_FLAGS) -o vector_add vector_add.cu
vector_multiply: vector_multiply.cu
$(NVCC) $(CUDA_FLAGS) -o vector_multiply vector_multiply.cu
matrix_multiply: matrix_multiply.cu
$(NVCC) $(CUDA_FLAGS) -o matrix_multiply matrix_multiply.cu
memory_demo: memory_management.cu
$(NVCC) $(CUDA_FLAGS) -o memory_demo memory_management.cu
clean:
rm -f vector_add vector_multiply matrix_multiply memory_demo
# Debug builds with device debugging enabled
debug: CUDA_FLAGS += -g -G
debug: all
# Profile builds optimized for profiling
profile: CUDA_FLAGS += -lineinfo
profile: all
.PHONY: all clean debug profile
2. Build Commands¶
root@sjsujetson-01:/Developer/mycuda_samples# make
nvcc -std=c++11 -O3 -arch=sm_87 -o vector_add vector_add.cu
nvcc -std=c++11 -O3 -arch=sm_87 -o vector_multiply vector_multiply.cu
nvcc -std=c++11 -O3 -arch=sm_87 -o matrix_multiply matrix_multiply.cu
nvcc -std=c++11 -O3 -arch=sm_87 -o memory_demo memory_management.cu
# Build all examples
make all
# Build individual examples
make vector_add
make matrix_multiply
# Build with debug information for profiling
make profile
# Clean build artifacts
make clean
3. Alternative CMake Build¶
# CMakeLists.txt
cmake_minimum_required(VERSION 3.18)
project(CudaExamples LANGUAGES CXX CUDA)
set(CMAKE_CXX_STANDARD 11)
set(CMAKE_CUDA_STANDARD 11)
# Find CUDA
find_package(CUDA REQUIRED)
# Set CUDA architecture (adjust for your GPU)
set(CMAKE_CUDA_ARCHITECTURES 87) # For Jetson Orin
# Add executables
add_executable(vector_add vector_add.cu)
add_executable(vector_multiply vector_multiply.cu)
add_executable(matrix_multiply matrix_multiply.cu)
add_executable(memory_demo memory_management.cu)
# Set properties
set_property(TARGET vector_add PROPERTY CUDA_SEPARABLE_COMPILATION ON)
set_property(TARGET vector_multiply PROPERTY CUDA_SEPARABLE_COMPILATION ON)
set_property(TARGET matrix_multiply PROPERTY CUDA_SEPARABLE_COMPILATION ON)
set_property(TARGET memory_demo PROPERTY CUDA_SEPARABLE_COMPILATION ON)
# CMake build commands
mkdir build && cd build
cmake ..
make -j4
π Profiling with Nsight Systems¶
1. Install Nsight Systems¶
Check the previous sections for the Nsight installation for Jetson
# Verify installation
root@sjsujetson-01:/Developer/mycuda_samples# nsys --version
NVIDIA Nsight Systems version 2024.5.4.34-245434855735v0
2. Basic Profiling Commands¶
# Profile vector addition with detailed GPU metrics
root@sjsujetson-01:/Developer/mycuda_samples# nsys profile --stats=true --force-overwrite=true -o vector_add_profile ./vector_add
Generated:
/Developer/mycuda_samples/vector_add_profile.nsys-rep
/Developer/mycuda_samples/vector_add_profile.sqlite
# Profile with CUDA API tracing
root@sjsujetson-01:/Developer/mycuda_samples# nsys profile --trace=cuda,nvtx --stats=true -o detailed_profile ./matrix_multiply
Generated:
/Developer/mycuda_samples/detailed_profile.nsys-rep
/Developer/mycuda_samples/detailed_profile.sqlite
# Profile with system-wide metrics
root@sjsujetson-01:/Developer/mycuda_samples# nsys profile --trace=cuda,osrt,nvtx --stats=true --force-overwrite=true -o system_profile ./vector_multiply
Generated:
/Developer/mycuda_samples/system_profile.nsys-rep
/Developer/mycuda_samples/system_profile.sqlite
root@sjsujetson-01:/Developer/mycuda_samples# ls
Makefile memory_management vector_add
detailed_profile.nsys-rep memory_management.cu vector_add.cu
detailed_profile.sqlite simple vector_add_profile.nsys-rep
matrix_multiply simple.cu vector_add_profile.sqlite
matrix_multiply.cu system_profile.nsys-rep vector_multiply
memory_demo system_profile.sqlite vector_multiply.cu
3. Advanced Profiling Options¶
# Profile with custom duration and sampling
root@sjsujetson-01:/Developer/mycuda_samples# nsys profile --duration=30 --sample=cpu --trace=cuda,osrt -o long_profile ./memory_demo
Collecting data...
[Memory Manager] Starting async transfer demo...
[Memory Manager] Starting unified memory demo...
First 5 squared values: 0 1 4 9 16
[Memory Manager] Running bandwidth test...
Host-to-Device Bandwidth: 7.44411 GB/s
Generating '/tmp/nsys-report-48f4.qdstrm'
[1/1] [========================100%] long_profile.nsys-rep
Generated:
/Developer/mycuda_samples/long_profile.nsys-rep
# Profile with environment variables for detailed CUDA info
root@sjsujetson-01:/Developer/mycuda_samples# CUDA_LAUNCH_BLOCKING=1 nsys profile --trace=cuda,nvtx --stats=true -o blocking_profile ./matrix_multiply
Generated:
/Developer/mycuda_samples/blocking_profile.nsys-rep
/Developer/mycuda_samples/blocking_profile.sqlite
# Profile with memory transfer analysis
root@sjsujetson-01:/Developer/mycuda_samples# nsys profile --trace=cuda,osrt --cuda-memory-usage=true -o memory_profile ./memory_demo
Collecting data...
[Memory Manager] Starting async transfer demo...
[Memory Manager] Starting unified memory demo...
First 5 squared values: 0 1 4 9 16
[Memory Manager] Running bandwidth test...
Host-to-Device Bandwidth: 7.35592 GB/s
Generating '/tmp/nsys-report-8aab.qdstrm'
[1/1] [========================100%] memory_profile.nsys-rep
Generated:
/Developer/mycuda_samples/memory_profile.nsys-rep
4. Analyzing Profile Results¶
# Generate text report
nsys stats vector_add_profile.nsys-rep
# Export to SQLite for custom analysis
nsys export --type=sqlite vector_add_profile.nsys-rep
# View in Nsight Systems GUI (if available)
nsight-sys vector_add_profile.nsys-rep
5. Code Instrumentation for Better Profiling¶
This part of the code is not tested.
// Add NVTX markers for better profiling visibility
#include <nvtx3/nvToolsExt.h>
void instrumented_vector_add() {
// Mark the beginning of memory allocation
nvtxRangePush("Memory Allocation");
float *d_A, *d_B, *d_C;
cudaMalloc(&d_A, bytes);
cudaMalloc(&d_B, bytes);
cudaMalloc(&d_C, bytes);
nvtxRangePop();
// Mark data transfer
nvtxRangePush("Host to Device Transfer");
cudaMemcpy(d_A, h_A.data(), bytes, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B.data(), bytes, cudaMemcpyHostToDevice);
nvtxRangePop();
// Mark kernel execution
nvtxRangePush("Kernel Execution");
vector_add_kernel<<<blocks, threads_per_block>>>(d_A, d_B, d_C, N);
cudaDeviceSynchronize();
nvtxRangePop();
// Mark result transfer
nvtxRangePush("Device to Host Transfer");
cudaMemcpy(h_C.data(), d_C, bytes, cudaMemcpyDeviceToHost);
nvtxRangePop();
}
6. Performance Optimization Tips¶
# Check GPU utilization
nsys stats --report=gpukernsum profile.nsys-rep
# Analyze memory bandwidth utilization
nsys stats --report=gpumemtimesum profile.nsys-rep
# Check for optimization opportunities
nsys stats --report=cudaapisum profile.nsys-rep
7. Automated Profiling Script¶
#!/bin/bash
# profile_all.sh - Automated profiling script
echo "Building examples..."
make clean && make profile
echo "Profiling vector addition..."
nsys profile --stats=true --force-overwrite=true -o vector_add_profile ./vector_add
echo "Profiling matrix multiplication..."
nsys profile --stats=true --force-overwrite=true -o matrix_multiply_profile ./matrix_multiply
echo "Generating reports..."
nsys stats vector_add_profile.nsys-rep > vector_add_report.txt
nsys stats matrix_multiply_profile.nsys-rep > matrix_multiply_report.txt
echo "Profiling complete. Check *_report.txt files for results."
π― Performance Benchmarking¶
Run and Compare Results¶
# Make executable and run
chmod +x profile_all.sh
./profile_all.sh
# Compare CPU vs GPU performance
time ./vector_add # GPU version
time ./vector_add_cpu # CPU version (if implemented)
# Monitor GPU usage during execution
#watch -n 1 nvidia-smi #nvidia-smi is not available in jetson