Contents — CUDA C++ Programming Guide

Last updated: 12/16/2025

Title: Contents — CUDA C++ Programming Guide

URL Source: https://docs.nvidia.com/cuda/archive/13.0.2/cuda-c-programming-guide/contents.html

Published Time: Thu, 04 Dec 2025 03:37:46 GMT

Markdown Content:

»
Contents
v13.0 |PDF|Archive

Contents

1. Overview
2. What Is the CUDA C Programming Guide?
3. Introduction
- 3.1. The Benefits of Using GPUs
- 3.2. CUDA®: A General-Purpose Parallel Computing Platform and Programming Model
- 3.3. A Scalable Programming Model
4. Changelog
5. Programming Model
- 5.1. Kernels
- 5.2. Thread Hierarchy
  - 5.2.1. Thread Block Clusters
  - 5.2.2. Blocks as Clusters
- 5.3. Memory Hierarchy
- 5.4. Heterogeneous Programming
- 5.5. Asynchronous SIMT Programming Model
  - 5.5.1. Asynchronous Operations
- 5.6. Compute Capability
6. Programming Interface
- 6.1. Compilation with NVCC
- 6.2. CUDA Runtime
- 6.3. Versioning and Compatibility
- 6.4. Compute Modes
- 6.5. Mode Switches
- 6.6. Tesla Compute Cluster Mode for Windows
7. Hardware Implementation
- 7.1. SIMT Architecture
- 7.2. Hardware Multithreading
8. Performance Guidelines
- 8.1. Overall Performance Optimization Strategies
- 8.2. Maximize Utilization
- 8.3. Maximize Memory Throughput
  - 8.3.1. Data Transfer between Host and Device
  - 8.3.2. Device Memory Accesses
- 8.4. Maximize Instruction Throughput
- 8.5. Minimize Memory Thrashing
9. CUDA-Enabled GPUs
10. C++ Language Extensions
- 10.1. Function Execution Space Specifiers
  - 10.1.1. global
  - 10.1.2. device
  - 10.1.3. host
  - 10.1.4. Undefined behavior
  - 10.1.5. noinline and forceinline
  - 10.1.6. inline_hint
- 10.2. Variable Memory Space Specifiers
  - 10.2.1. device
  - 10.2.2. constant
  - 10.2.3. shared
  - 10.2.4. grid_constant
  - 10.2.5. managed
  - 10.2.6. restrict
- 10.3. Built-in Vector Types
  - 10.3.1. char, short, int, long, longlong, float, double
  - 10.3.2. dim3
- 10.4. Built-in Variables
  - 10.4.1. gridDim
  - 10.4.2. blockIdx
  - 10.4.3. blockDim
  - 10.4.4. threadIdx
  - 10.4.5. warpSize
- 10.5. Memory Fence Functions
- 10.6. Synchronization Functions
- 10.7. Mathematical Functions
- 10.8. Texture Functions
  - 10.8.1. Texture Object API
- 10.9. Surface Functions
  - 10.9.1. Surface Object API
- 10.10. Read-Only Data Cache Load Function
- 10.11. Load Functions Using Cache Hints
- 10.12. Store Functions Using Cache Hints
- 10.13. Time Function
- 10.14. Atomic Functions
  - 10.14.1. Arithmetic Functions
  - 10.14.2. Bitwise Functions
  - 10.14.3. Other atomic functions
- 10.15. Address Space Predicate Functions
  - 10.15.1. __isGlobal()
  - 10.15.2. __isShared()
  - 10.15.3. __isConstant()
  - 10.15.4. __isGridConstant()
  - 10.15.5. __isLocal()
- 10.16. Address Space Conversion Functions
  - 10.16.1. __cvta_generic_to_global()
  - 10.16.2. __cvta_generic_to_shared()
  - 10.16.3. __cvta_generic_to_constant()
  - 10.16.4. __cvta_generic_to_local()
  - 10.16.5. __cvta_global_to_generic()
  - 10.16.6. __cvta_shared_to_generic()
  - 10.16.7. __cvta_constant_to_generic()
  - 10.16.8. __cvta_local_to_generic()
- 10.17. Alloca Function
  - 10.17.1. Synopsis
  - 10.17.2. Description
  - 10.17.3. Example
- 10.18. Compiler Optimization Hint Functions
  - 10.18.1. __builtin_assume_aligned()
  - 10.18.2. __builtin_assume()
  - 10.18.3. __assume()
  - 10.18.4. __builtin_expect()
  - 10.18.5. __builtin_unreachable()
  - 10.18.6. Restrictions
- 10.19. Warp Vote Functions
- 10.20. Warp Match Functions
  - 10.20.1. Synopsis
  - 10.20.2. Description
- 10.21. Warp Reduce Functions
  - 10.21.1. Synopsis
  - 10.21.2. Description
- 10.22. Warp Shuffle Functions
  - 10.22.1. Synopsis
  - 10.22.2. Description
  - 10.22.3. Examples
- 10.23. Nanosleep Function
  - 10.23.1. Synopsis
  - 10.23.2. Description
  - 10.23.3. Example
- 10.24. Warp Matrix Functions
  - 10.24.1. Description
  - 10.24.2. Alternate Floating Point
  - 10.24.3. Double Precision
  - 10.24.4. Sub-byte Operations
  - 10.24.5. Restrictions
  - 10.24.6. Element Types and Matrix Sizes
  - 10.24.7. Example
- 10.25. DPX
  - 10.25.1. Examples
- 10.26. Asynchronous Barrier
  - 10.26.1. Simple Synchronization Pattern
  - 10.26.2. Temporal Splitting and Five Stages of Synchronization
  - 10.26.3. Bootstrap Initialization, Expected Arrival Count, and Participation
  - 10.26.4. A Barrier’s Phase: Arrival, Countdown, Completion, and Reset
  - 10.26.5. Spatial Partitioning (also known as Warp Specialization)
  - 10.26.6. Early Exit (Dropping out of Participation)
  - 10.26.7. Completion Function
  - 10.26.8. Memory Barrier Primitives Interface
    - 10.26.8.1. Data Types
    - 10.26.8.2. Memory Barrier Primitives API
- 10.27. Asynchronous Data Copies
  - 10.27.1. memcpy_async API
  - 10.27.2. Copy and Compute Pattern - Staging Data Through Shared Memory
  - 10.27.3. Without memcpy_async
  - 10.27.4. With memcpy_async
  - 10.27.5. Asynchronous Data Copies using cuda::barrier
  - 10.27.6. Performance Guidance for memcpy_async
- 10.28. Asynchronous Data Copies using cuda::pipeline
  - 10.28.1. Single-Stage Asynchronous Data Copies using cuda::pipeline
  - 10.28.2. Multi-Stage Asynchronous Data Copies using cuda::pipeline
  - 10.28.3. Pipeline Interface
  - 10.28.4. Pipeline Primitives Interface
- 10.29. Asynchronous Data Copies using the Tensor Memory Accelerator (TMA)
  - 10.29.1. Using TMA to transfer one-dimensional arrays
  - 10.29.2. Using TMA to transfer multi-dimensional arrays
    - 10.29.2.1. Multi-dimensional TMA PTX wrappers
  - 10.29.3. TMA Swizzle
    - 10.29.3.1. Example ‘Matrix Transpose’
    - 10.29.3.2. The Swizzle Modes
- 10.30. Encoding a Tensor Map on Device
  - 10.30.1. Device-side Encoding and Modification of a Tensor Map
  - 10.30.2. Usage of a Modified Tensor Map
  - 10.30.3. Creating a Template Tensor Map Value Using the Driver API
- 10.31. Profiler Counter Function
- 10.32. Assertion
- 10.33. Trap function
- 10.34. Breakpoint Function
- 10.35. Formatted Output
  - 10.35.1. Format Specifiers
  - 10.35.2. Limitations
  - 10.35.3. Associated Host-Side API
  - 10.35.4. Examples
- 10.36. Dynamic Global Memory Allocation and Operations
  - 10.36.1. Heap Memory Allocation
  - 10.36.2. Interoperability with Host Memory API
  - 10.36.3. Examples
- 10.37. Execution Configuration
- 10.38. Launch Bounds
- 10.39. Maximum Number of Registers per Thread
- 10.40. #pragma unroll
- 10.41. SIMD Video Instructions
- 10.42. Diagnostic Pragmas
- 10.43. Custom ABI Pragmas
- 10.44. CUDA C++ Memory Model
- 10.45. CUDA C++ Execution Model
11. Cooperative Groups
- 11.1. Introduction
- 11.2. What’s New in Cooperative Groups
- 11.3. Programming Model Concept
  - 11.3.1. Composition Example
- 11.4. Group Types
  - 11.4.1. Implicit Groups
  - 11.4.2. Explicit Groups
    - 11.4.2.1. Thread Block Tile
      - 11.4.2.1.1. Warp-Synchronous Code Pattern
      - 11.4.2.1.2. Single Thread Group
    - 11.4.2.2. Coalesced Groups
      - 11.4.2.2.1. Discovery Pattern
- 11.5. Group Partitioning
- 11.6. Group Collectives
- 11.7. Grid Synchronization
12. Cluster Launch Control
- 12.1. Introduction
- 12.2. Cluster Launch Control API Details
13. CUDA Dynamic Parallelism
- 13.1. Introduction
  - 13.1.1. Overview
  - 13.1.2. Glossary
- 13.2. Execution Environment and Memory Model
  - 13.2.1. Execution Environment
  - 13.2.2. Memory Model
    - 13.2.2.1. Coherence and Consistency
- 13.3. Programming Interface
- 13.4. Programming Guidelines
- 13.5. CDP2 vs CDP1
  - 13.5.1. Differences Between CDP1 and CDP2
  - 13.5.2. Compatibility and Interoperability
- 13.6. Legacy CUDA Dynamic Parallelism (CDP1)
14. Virtual Memory Management
- 14.1. Introduction
- 14.2. Query for Support
- 14.3. Allocating Physical Memory
  - 14.3.1. Shareable Memory Allocations
  - 14.3.2. Memory Type
    - 14.3.2.1. Compressible Memory
- 14.4. Reserving a Virtual Address Range
- 14.5. Virtual Aliasing Support
- 14.6. Mapping Memory
- 14.7. Controlling Access Rights
- 14.8. Fabric Memory
  - 14.8.1. Query for Support
- 14.9. Multicast Support
15. Stream Ordered Memory Allocator
- 15.1. Introduction
- 15.2. Query for Support
- 15.3. API Fundamentals (cudaMallocAsync and cudaFreeAsync)
- 15.4. Memory Pools and the cudaMemPool_t
- 15.5. Default/Implicit Pools
- 15.6. Explicit Pools
- 15.7. Physical Page Caching Behavior
- 15.8. Resource Usage Statistics
- 15.9. Memory Reuse Policies
- 15.10. Device Accessibility for Multi-GPU Support
- 15.11. IPC Memory Pools
- 15.12. Synchronization API Actions
- 15.13. Addendums
16. Graph Memory Nodes
- 16.1. Introduction
- 16.2. Support and Compatibility
- 16.3. API Fundamentals
- 16.4. Optimized Memory Reuse
  - 16.4.1. Address Reuse within a Graph
  - 16.4.2. Physical Memory Management and Sharing
- 16.5. Performance Considerations
  - 16.5.1. First Launch / cudaGraphUpload
- 16.6. Physical Memory Footprint
- 16.7. Peer Access
  - 16.7.1. Peer Access with Graph Node APIs
  - 16.7.2. Peer Access with Stream Capture
- 16.8. Memory Nodes in Child Graphs
17. Mathematical Functions
- 17.1. Standard Functions
- 17.2. Intrinsic Functions
18. C++ Language Support
- 18.1. C++11 Language Features
- 18.2. C++14 Language Features
- 18.3. C++17 Language Features
- 18.4. C++20 Language Features
- 18.5. Restrictions
- 18.6. Polymorphic Function Wrappers
- 18.7. Extended Lambdas
- 18.8. Relaxed Constexpr (-expt-relaxed-constexpr)
- 18.9. Code Samples
19. Texture Fetching
- 19.1. Nearest-Point Sampling
- 19.2. Linear Filtering
- 19.3. Table Lookup
20. Compute Capabilities
- 20.1. Feature Availability
- 20.2. Features and Technical Specifications
- 20.3. Floating-Point Standard
- 20.4. Compute Capability 5.x
- 20.5. Compute Capability 6.x
- 20.6. Compute Capability 7.x
- 20.7. Compute Capability 8.x
- 20.8. Compute Capability 9.0
- 20.9. Compute Capability 10.0
- 20.10. Compute Capability 12.0
21. Driver API
- 21.1. Context
- 21.2. Module
- 21.3. Kernel Execution
- 21.4. Interoperability between Runtime and Driver APIs
- 21.5. Driver Entry Point Access
22. CUDA Environment Variables
23. Error Log Management
- 23.1. Background
- 23.2. Activation
- 23.3. Output
- 23.4. API Description
- 23.5. Limitations and Known Issues
24. Unified Memory Programming
- 24.1. Unified Memory Introduction
  - 24.1.1. System Requirements for Unified Memory
  - 24.1.2. Programming Model
- 24.2. Unified memory on devices with full CUDA Unified Memory support
  - 24.2.1. System-Allocated Memory: in-depth examples
    - 24.2.1.1. File-backed Unified Memory
    - 24.2.1.2. Inter-Process Communication (IPC) with Unified Memory
  - 24.2.2. Performance Tuning
- 24.3. Unified memory on devices without full CUDA Unified Memory support
  - 24.3.1. Unified memory on devices with only CUDA Managed Memory support
  - 24.3.2. Unified memory on Windows or devices with compute capability 5.x
25. Lazy Loading
- 25.1. What is Lazy Loading?
- 25.2. Lazy Loading version support
- 25.3. Triggering loading of kernels in lazy mode
  - 25.3.1. CUDA Driver API
  - 25.3.2. CUDA Runtime API
- 25.4. Querying whether Lazy Loading is Turned On
- 25.5. Possible Issues when Adopting Lazy Loading
26. Extended GPU Memory
- 26.1. Preliminaries
- 26.2. Using the EGM Interface
27. Notices
- 27.1. Notice
- 27.2. OpenCL
- 27.3. Trademarks

Privacy Policy | Manage My Privacy | Do Not Sell or Share My Data | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2007-2025, NVIDIA Corporation & affiliates. All rights reserved.

Last updated on Nov 02, 2025.

Links/Buttons:

Related Articles