Contents — CUDA C++ Programming Guide
Title: Contents — CUDA C++ Programming Guide
URL Source: https://docs.nvidia.com/cuda/archive/13.0.2/cuda-c-programming-guide/contents.html
Published Time: Thu, 04 Dec 2025 03:37:46 GMT
Markdown Content:
Contents
-
-
-
6.2.3. Device Memory L2 Access Management
- 6.2.3.1. L2 Cache Set-Aside for Persisting Accesses
- 6.2.3.2. L2 Policy for Persisting Accesses
- 6.2.3.3. L2 Access Properties
- 6.2.3.4. L2 Persistence Example
- 6.2.3.5. Reset L2 Access to Normal
- 6.2.3.6. Manage Utilization of L2 set-aside cache
- 6.2.3.7. Query L2 cache Properties
- 6.2.3.8. Control L2 Cache Set-Aside Size for Persisting Memory Access
-
6.2.14. Texture and Surface Memory
-
6.2.16. External Resource Interoperability
-
6.2.16.1. Vulkan Interoperability
- 6.2.16.1.1. Matching device UUIDs
- 6.2.16.1.2. Importing Memory Objects
- 6.2.16.1.3. Mapping Buffers onto Imported Memory Objects
- 6.2.16.1.4. Mapping Mipmapped Arrays onto Imported Memory Objects
- 6.2.16.1.5. Importing Synchronization Objects
- 6.2.16.1.6. Signaling/Waiting on Imported Synchronization Objects
-
6.2.16.3. Direct3D 12 Interoperability
- 6.2.16.3.1. Matching Device LUIDs
- 6.2.16.3.2. Importing Memory Objects
- 6.2.16.3.3. Mapping Buffers onto Imported Memory Objects
- 6.2.16.3.4. Mapping Mipmapped Arrays onto Imported Memory Objects
- 6.2.16.3.5. Importing Synchronization Objects
- 6.2.16.3.6. Signaling/Waiting on Imported Synchronization Objects
-
6.2.16.4. Direct3D 11 Interoperability
- 6.2.16.4.1. Matching Device LUIDs
- 6.2.16.4.2. Importing Memory Objects
- 6.2.16.4.3. Mapping Buffers onto Imported Memory Objects
- 6.2.16.4.4. Mapping Mipmapped Arrays onto Imported Memory Objects
- 6.2.16.4.5. Importing Synchronization Objects
- 6.2.16.4.6. Signaling/Waiting on Imported Synchronization Objects
-
6.2.16.5. NVIDIA Software Communication Interface Interoperability (NVSCI)
-
-
-
-
- 10.8.1. Texture Object API
- 10.8.1.1. tex1Dfetch()
- 10.8.1.2. tex1D()
- 10.8.1.3. tex1DLod()
- 10.8.1.4. tex1DGrad()
- 10.8.1.5. tex2D()
- 10.8.1.6. tex2D() for sparse CUDA arrays
- 10.8.1.7. tex2Dgather()
- 10.8.1.8. tex2Dgather() for sparse CUDA arrays
- 10.8.1.9. tex2DGrad()
- 10.8.1.10. tex2DGrad() for sparse CUDA arrays
- 10.8.1.11. tex2DLod()
- 10.8.1.12. tex2DLod() for sparse CUDA arrays
- 10.8.1.13. tex3D()
- 10.8.1.14. tex3D() for sparse CUDA arrays
- 10.8.1.15. tex3DLod()
- 10.8.1.16. tex3DLod() for sparse CUDA arrays
- 10.8.1.17. tex3DGrad()
- 10.8.1.18. tex3DGrad() for sparse CUDA arrays
- 10.8.1.19. tex1DLayered()
- 10.8.1.20. tex1DLayeredLod()
- 10.8.1.21. tex1DLayeredGrad()
- 10.8.1.22. tex2DLayered()
- 10.8.1.23. tex2DLayered() for Sparse CUDA Arrays
- 10.8.1.24. tex2DLayeredLod()
- 10.8.1.25. tex2DLayeredLod() for sparse CUDA arrays
- 10.8.1.26. tex2DLayeredGrad()
- 10.8.1.27. tex2DLayeredGrad() for sparse CUDA arrays
- 10.8.1.28. texCubemap()
- 10.8.1.29. texCubemapGrad()
- 10.8.1.30. texCubemapLod()
- 10.8.1.31. texCubemapLayered()
- 10.8.1.32. texCubemapLayeredGrad()
- 10.8.1.33. texCubemapLayeredLod()
- 10.8.1. Texture Object API
-
- 10.9.1. Surface Object API
- 10.9.1.1. surf1Dread()
- 10.9.1.2. surf1Dwrite
- 10.9.1.3. surf2Dread()
- 10.9.1.4. surf2Dwrite()
- 10.9.1.5. surf3Dread()
- 10.9.1.6. surf3Dwrite()
- 10.9.1.7. surf1DLayeredread()
- 10.9.1.8. surf1DLayeredwrite()
- 10.9.1.9. surf2DLayeredread()
- 10.9.1.10. surf2DLayeredwrite()
- 10.9.1.11. surfCubemapread()
- 10.9.1.12. surfCubemapwrite()
- 10.9.1.13. surfCubemapLayeredread()
- 10.9.1.14. surfCubemapLayeredwrite()
- 10.9.1. Surface Object API
-
-
- 10.14.1.1. atomicAdd()
- 10.14.1.2. atomicSub()
- 10.14.1.3. atomicExch()
- 10.14.1.4. atomicMin()
- 10.14.1.5. atomicMax()
- 10.14.1.6. atomicInc()
- 10.14.1.7. atomicDec()
- 10.14.1.8. atomicCAS()
- 10.14.1.9. __nv_atomic_exchange()
- 10.14.1.10. __nv_atomic_exchange_n()
- 10.14.1.11. __nv_atomic_compare_exchange()
- 10.14.1.12. __nv_atomic_compare_exchange_n()
- 10.14.1.13. __nv_atomic_fetch_add() and __nv_atomic_add()
- 10.14.1.14. __nv_atomic_fetch_sub() and __nv_atomic_sub()
- 10.14.1.15. __nv_atomic_fetch_min() and __nv_atomic_min()
- 10.14.1.16. __nv_atomic_fetch_max() and __nv_atomic_max()
-
-
- 10.26.1. Simple Synchronization Pattern
- 10.26.2. Temporal Splitting and Five Stages of Synchronization
- 10.26.3. Bootstrap Initialization, Expected Arrival Count, and Participation
- 10.26.4. A Barrier’s Phase: Arrival, Countdown, Completion, and Reset
- 10.26.5. Spatial Partitioning (also known as Warp Specialization)
- 10.26.6. Early Exit (Dropping out of Participation)
- 10.26.7. Completion Function
- 10.26.8. Memory Barrier Primitives Interface
-
10.29. Asynchronous Data Copies using the Tensor Memory Accelerator (TMA)
-
-
-
13.2. Execution Environment and Memory Model
-
-
-
-
13.6. Legacy CUDA Dynamic Parallelism (CDP1)
-
13.6.1. Execution Environment and Memory Model (CDP1)
-
13.6.3. Programming Guidelines (CDP1)
-
13.6.3.3. Implementation Restrictions and Limitations (CDP1)
- 13.6.3.3.1. Runtime (CDP1)
- 13.6.3.3.1.1. Memory Footprint (CDP1)
- 13.6.3.3.1.2. Nesting and Synchronization Depth (CDP1)
- 13.6.3.3.1.3. Pending Kernel Launches (CDP1)
- 13.6.3.3.1.4. Configuration Options (CDP1)
- 13.6.3.3.1.5. Memory Allocation and Lifetime (CDP1)
- 13.6.3.3.1.6. SM Id and Warp Id (CDP1)
- 13.6.3.3.1.7. ECC Errors (CDP1)
- 13.6.3.3.1. Runtime (CDP1)
-
-
-
15. Stream Ordered Memory Allocator
-
-
-
24. Unified Memory Programming
-
24.1. Unified Memory Introduction
- 24.1.1. System Requirements for Unified Memory
- 24.1.2. Programming Model
- 24.1.2.1. Allocation APIs for System-Allocated Memory
- 24.1.2.2. Allocation API for CUDA Managed Memory:
cudaMallocManaged() - 24.1.2.3. Global-Scope Managed Variables Using
__managed__ - 24.1.2.4. Difference between Unified Memory and Mapped Memory
- 24.1.2.5. Pointer Attributes
- 24.1.2.6. Runtime detection of Unified Memory Support Level
- 24.1.2.7. GPU Memory Oversubscription
- 24.1.2.8. Performance Hints
-
24.2. Unified memory on devices with full CUDA Unified Memory support
-
24.3. Unified memory on devices without full CUDA Unified Memory support
- 24.3.1. Unified memory on devices with only CUDA Managed Memory support
- 24.3.2. Unified memory on Windows or devices with compute capability 5.x
- 24.3.2.1. Data Migration and Coherency
- 24.3.2.2. GPU Memory Oversubscription
- 24.3.2.3. Multi-GPU
- 24.3.2.4. Coherency and Concurrency
- 24.3.2.4.1. GPU Exclusive Access To Managed Memory
- 24.3.2.4.2. Explicit Synchronization and Logical GPU Activity
- 24.3.2.4.3. Managing Data Visibility and Concurrent CPU + GPU Access with Streams
- 24.3.2.4.4. Stream Association Examples
- 24.3.2.4.5. Stream Attach With Multithreaded Host Programs
- 24.3.2.4.6. Advanced Topic: Modular Programs and Data Access Constraints
- 24.3.2.4.7. Memcpy()/Memset() Behavior With Stream-associated Unified Memory
-
-
-
Privacy Policy | Manage My Privacy | Do Not Sell or Share My Data | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact
Copyright © 2007-2025, NVIDIA Corporation & affiliates. All rights reserved.
Last updated on Nov 02, 2025.
Links/Buttons:
- Archive
-
- 1. Overview
- 2. What Is the CUDA C Programming Guide?
- 3. Introduction
- 3.1. The Benefits of Using GPUs
- 3.2. CUDA®: A General-Purpose Parallel Computing Platform and Programming Model
- 3.3. A Scalable Programming Model
- 4. Changelog
- 5. Programming Model
- 5.1. Kernels
- 5.2. Thread Hierarchy
- 5.2.1. Thread Block Clusters
- 5.2.2. Blocks as Clusters
- 5.3. Memory Hierarchy
- 5.4. Heterogeneous Programming
- 5.5. Asynchronous SIMT Programming Model
- 5.5.1. Asynchronous Operations
- 5.6. Compute Capability
- 6. Programming Interface
- 6.1. Compilation with NVCC
- 6.1.1. Compilation Workflow
- 6.1.1.1. Offline Compilation
- 6.1.1.2. Just-in-Time Compilation
- 6.1.2. Binary Compatibility
- 6.1.3. PTX Compatibility
- 6.1.4. Application Compatibility
- 6.1.5. C++ Compatibility
- 6.1.6. 64-Bit Compatibility
- 6.2. CUDA Runtime
- 6.2.1. Initialization
- 6.2.2. Device Memory
- 6.2.3. Device Memory L2 Access Management
- 6.2.3.1. L2 Cache Set-Aside for Persisting Accesses
- 6.2.3.2. L2 Policy for Persisting Accesses
- 6.2.3.3. L2 Access Properties
- 6.2.3.4. L2 Persistence Example
- 6.2.3.5. Reset L2 Access to Normal
- 6.2.3.6. Manage Utilization of L2 set-aside cache
- 6.2.3.7. Query L2 cache Properties
- 6.2.3.8. Control L2 Cache Set-Aside Size for Persisting Memory Access
- 6.2.4. Shared Memory
- 6.2.5. Distributed Shared Memory
- 6.2.6. Page-Locked Host Memory
- 6.2.6.1. Portable Memory
- 6.2.6.2. Write-Combining Memory
- 6.2.6.3. Mapped Memory
- 6.2.7. Memory Synchronization Domains
- 6.2.7.1. Memory Fence Interference
- 6.2.7.2. Isolating Traffic with Domains
- 6.2.7.3. Using Domains in CUDA
- 6.2.8. Asynchronous Concurrent Execution
- 6.2.8.1. Concurrent Execution between Host and Device
- 6.2.8.2. Concurrent Kernel Execution
- 6.2.8.3. Overlap of Data Transfer and Kernel Execution
- 6.2.8.4. Concurrent Data Transfers
- 6.2.8.5. Streams
- 6.2.8.5.1. Creation and Destruction of Streams
- 6.2.8.5.2. Default Stream
- 6.2.8.5.3. Explicit Synchronization
- 6.2.8.5.4. Implicit Synchronization
- 6.2.8.5.5. Overlapping Behavior
- 6.2.8.5.6. Host Functions (Callbacks)
- 6.2.8.5.7. Stream Priorities
- 6.2.8.6. Programmatic Dependent Launch and Synchronization
- 6.2.8.6.1. Background
- 6.2.8.6.2. API Description
- 6.2.8.6.3. Use in CUDA Graphs
- 6.2.8.7. CUDA Graphs
- 6.2.8.7.1. Graph Structure
- 6.2.8.7.1.1. Node Types
- 6.2.8.7.1.2. Edge Data
- 6.2.8.7.2. Creating a Graph Using Graph APIs
- 6.2.8.7.3. Creating a Graph Using Stream Capture
- 6.2.8.7.3.1. Cross-stream Dependencies and Events
- 6.2.8.7.3.2. Prohibited and Unhandled Operations
- 6.2.8.7.3.3. Invalidation
- 6.2.8.7.4. CUDA User Objects
- 6.2.8.7.5. Updating Instantiated Graphs
- 6.2.8.7.5.1. Graph Update Limitations
- 6.2.8.7.5.2. Whole Graph Update
- 6.2.8.7.5.3. Individual Node Update
- 6.2.8.7.5.4. Individual Node Enable
- 6.2.8.7.6. Using Graph APIs
- 6.2.8.7.7. Device Graph Launch
- 6.2.8.7.7.1. Device Graph Creation
- 6.2.8.7.7.1.1. Device Graph Requirements
- 6.2.8.7.7.1.2. Device Graph Upload
- 6.2.8.7.7.1.3. Device Graph Update
- 6.2.8.7.7.2. Device Launch
- 6.2.8.7.7.2.1. Device Launch Modes
- 6.2.8.7.7.2.1.1. Fire and Forget Launch
- 6.2.8.7.7.2.1.2. Graph Execution Environments
- 6.2.8.7.7.2.1.3. Tail Launch
- 6.2.8.7.7.2.1.3.1. Tail Self-launch
- 6.2.8.7.7.2.1.4. Sibling Launch
- 6.2.8.7.8. Conditional Graph Nodes
- 6.2.8.7.8.1. Conditional Handles
- 6.2.8.7.8.2. Conditional Node Body Graph Requirements
- 6.2.8.7.8.3. Conditional IF Nodes
- 6.2.8.7.8.4. Conditional WHILE Nodes
- 6.2.8.7.8.5. Conditional SWITCH Nodes
- 6.2.8.8. Events
- 6.2.8.8.1. Creation and Destruction of Events
- 6.2.8.8.2. Elapsed Time
- 6.2.8.9. Synchronous Calls
- 6.2.9. Multi-Device System
- 6.2.9.1. Device Enumeration
- 6.2.9.2. Device Selection
- 6.2.9.3. Stream and Event Behavior
- 6.2.9.4. Peer-to-Peer Memory Access
- 6.2.9.4.1. IOMMU on Linux
- 6.2.9.5. Peer-to-Peer Memory Copy
- 6.2.10. Unified Virtual Address Space
- 6.2.11. Interprocess Communication
- 6.2.12. Error Checking
- 6.2.13. Call Stack
- 6.2.14. Texture and Surface Memory
- 6.2.14.1. Texture Memory
- 6.2.14.1.1. Texture Object API
- 6.2.14.1.2. 16-Bit Floating-Point Textures
- 6.2.14.1.3. Layered Textures
- 6.2.14.1.4. Cubemap Textures
- 6.2.14.1.5. Cubemap Layered Textures
- 6.2.14.1.6. Texture Gather
- 6.2.14.2. Surface Memory
- 6.2.14.2.1. Surface Object API
- 6.2.14.2.2. Cubemap Surfaces
- 6.2.14.2.3. Cubemap Layered Surfaces
- 6.2.14.3. CUDA Arrays
- 6.2.14.4. Read/Write Coherency
- 6.2.15. Graphics Interoperability
- 6.2.15.1. OpenGL Interoperability
- 6.2.15.2. Direct3D Interoperability
- 6.2.15.2.1. Direct3D 9 Version
- 6.2.15.2.2. Direct3D 10 Version
- 6.2.15.2.3. Direct3D 11 Version
- 6.2.15.3. SLI Interoperability
- 6.2.16. External Resource Interoperability
- 6.2.16.1. Vulkan Interoperability
- 6.2.16.1.1. Matching device UUIDs
- 6.2.16.1.2. Importing Memory Objects
- 6.2.16.1.3. Mapping Buffers onto Imported Memory Objects
- 6.2.16.1.4. Mapping Mipmapped Arrays onto Imported Memory Objects
- 6.2.16.1.5. Importing Synchronization Objects
- 6.2.16.1.6. Signaling/Waiting on Imported Synchronization Objects
- 6.2.16.2. OpenGL Interoperability
- 6.2.16.3. Direct3D 12 Interoperability
- 6.2.16.3.1. Matching Device LUIDs
- 6.2.16.3.2. Importing Memory Objects
- 6.2.16.3.3. Mapping Buffers onto Imported Memory Objects
- 6.2.16.3.4. Mapping Mipmapped Arrays onto Imported Memory Objects
- 6.2.16.3.5. Importing Synchronization Objects
- 6.2.16.3.6. Signaling/Waiting on Imported Synchronization Objects
- 6.2.16.4. Direct3D 11 Interoperability
- 6.2.16.4.1. Matching Device LUIDs
- 6.2.16.4.2. Importing Memory Objects
- 6.2.16.4.3. Mapping Buffers onto Imported Memory Objects
- 6.2.16.4.4. Mapping Mipmapped Arrays onto Imported Memory Objects
- 6.2.16.4.5. Importing Synchronization Objects
- 6.2.16.4.6. Signaling/Waiting on Imported Synchronization Objects
- 6.2.16.5. NVIDIA Software Communication Interface Interoperability (NVSCI)
- 6.2.16.5.1. Importing Memory Objects
- 6.2.16.5.2. Mapping Buffers onto Imported Memory Objects
- 6.2.16.5.3. Mapping Mipmapped Arrays onto Imported Memory Objects
- 6.2.16.5.4. Importing Synchronization Objects
- 6.2.16.5.5. Signaling/Waiting on Imported Synchronization Objects
- 6.3. Versioning and Compatibility
- 6.4. Compute Modes
- 6.5. Mode Switches
- 6.6. Tesla Compute Cluster Mode for Windows
- 7. Hardware Implementation
- 7.1. SIMT Architecture
- 7.2. Hardware Multithreading
- 8. Performance Guidelines
- 8.1. Overall Performance Optimization Strategies
- 8.2. Maximize Utilization
- 8.2.1. Application Level
- 8.2.2. Device Level
- 8.2.3. Multiprocessor Level
- 8.2.3.1. Occupancy Calculator
- 8.3. Maximize Memory Throughput
- 8.3.1. Data Transfer between Host and Device
- 8.3.2. Device Memory Accesses
- 8.4. Maximize Instruction Throughput
- 8.5. Minimize Memory Thrashing
- 9. CUDA-Enabled GPUs
- 10. C++ Language Extensions
- 10.1. Function Execution Space Specifiers
- 10.1.1. global
- 10.1.2. device
- 10.1.3. host
- 10.1.4. Undefined behavior
- 10.1.5. noinline and forceinline
- 10.1.6. inline_hint
- 10.2. Variable Memory Space Specifiers
- 10.2.1. device
- 10.2.2. constant
- 10.2.3. shared
- 10.2.4. grid_constant
- 10.2.5. managed
- 10.2.6. restrict
- 10.3. Built-in Vector Types
- 10.3.1. char, short, int, long, longlong, float, double
- 10.3.2. dim3
- 10.4. Built-in Variables
- 10.4.1. gridDim
- 10.4.2. blockIdx
- 10.4.3. blockDim
- 10.4.4. threadIdx
- 10.4.5. warpSize
- 10.5. Memory Fence Functions
- 10.6. Synchronization Functions
- 10.7. Mathematical Functions
- 10.8. Texture Functions
- 10.8.1. Texture Object API
- 10.8.1.1. tex1Dfetch()
- 10.8.1.2. tex1D()
- 10.8.1.3. tex1DLod()
- 10.8.1.4. tex1DGrad()
- 10.8.1.5. tex2D()
- 10.8.1.6. tex2D() for sparse CUDA arrays
- 10.8.1.7. tex2Dgather()
- 10.8.1.8. tex2Dgather() for sparse CUDA arrays
- 10.8.1.9. tex2DGrad()
- 10.8.1.10. tex2DGrad() for sparse CUDA arrays
- 10.8.1.11. tex2DLod()
- 10.8.1.12. tex2DLod() for sparse CUDA arrays
- 10.8.1.13. tex3D()
- 10.8.1.14. tex3D() for sparse CUDA arrays
- 10.8.1.15. tex3DLod()
- 10.8.1.16. tex3DLod() for sparse CUDA arrays
- 10.8.1.17. tex3DGrad()
- 10.8.1.18. tex3DGrad() for sparse CUDA arrays
- 10.8.1.19. tex1DLayered()
- 10.8.1.20. tex1DLayeredLod()
- 10.8.1.21. tex1DLayeredGrad()
- 10.8.1.22. tex2DLayered()
- 10.8.1.23. tex2DLayered() for Sparse CUDA Arrays
- 10.8.1.24. tex2DLayeredLod()
- 10.8.1.25. tex2DLayeredLod() for sparse CUDA arrays
- 10.8.1.26. tex2DLayeredGrad()
- 10.8.1.27. tex2DLayeredGrad() for sparse CUDA arrays
- 10.8.1.28. texCubemap()
- 10.8.1.29. texCubemapGrad()
- 10.8.1.30. texCubemapLod()
- 10.8.1.31. texCubemapLayered()
- 10.8.1.32. texCubemapLayeredGrad()
- 10.8.1.33. texCubemapLayeredLod()
- 10.9. Surface Functions
- 10.9.1. Surface Object API
- 10.9.1.1. surf1Dread()
- 10.9.1.2. surf1Dwrite
- 10.9.1.3. surf2Dread()
- 10.9.1.4. surf2Dwrite()
- 10.9.1.5. surf3Dread()
- 10.9.1.6. surf3Dwrite()
- 10.9.1.7. surf1DLayeredread()
- 10.9.1.8. surf1DLayeredwrite()
- 10.9.1.9. surf2DLayeredread()
- 10.9.1.10. surf2DLayeredwrite()
- 10.9.1.11. surfCubemapread()
- 10.9.1.12. surfCubemapwrite()
- 10.9.1.13. surfCubemapLayeredread()
- 10.9.1.14. surfCubemapLayeredwrite()
- 10.10. Read-Only Data Cache Load Function
- 10.11. Load Functions Using Cache Hints
- 10.12. Store Functions Using Cache Hints
- 10.13. Time Function
- 10.14. Atomic Functions
- 10.14.1. Arithmetic Functions
- 10.14.1.1. atomicAdd()
- 10.14.1.2. atomicSub()
- 10.14.1.3. atomicExch()
- 10.14.1.4. atomicMin()
- 10.14.1.5. atomicMax()
- 10.14.1.6. atomicInc()
- 10.14.1.7. atomicDec()
- 10.14.1.8. atomicCAS()
- 10.14.1.9. __nv_atomic_exchange()
- 10.14.1.10. __nv_atomic_exchange_n()
- 10.14.1.11. __nv_atomic_compare_exchange()
- 10.14.1.12. __nv_atomic_compare_exchange_n()
- 10.14.1.13. __nv_atomic_fetch_add() and __nv_atomic_add()
- 10.14.1.14. __nv_atomic_fetch_sub() and __nv_atomic_sub()
- 10.14.1.15. __nv_atomic_fetch_min() and __nv_atomic_min()
- 10.14.1.16. __nv_atomic_fetch_max() and __nv_atomic_max()
- 10.14.2. Bitwise Functions
- 10.14.2.1. atomicAnd()
- 10.14.2.2. atomicOr()
- 10.14.2.3. atomicXor()
- 10.14.2.4. __nv_atomic_fetch_or() and __nv_atomic_or()
- 10.14.2.5. __nv_atomic_fetch_xor() and __nv_atomic_xor()
- 10.14.2.6. __nv_atomic_fetch_and() and __nv_atomic_and()
- 10.14.3. Other atomic functions
- 10.14.3.1. __nv_atomic_load()
- 10.14.3.2. __nv_atomic_load_n()
- 10.14.3.3. __nv_atomic_store()
- 10.14.3.4. __nv_atomic_store_n()
- 10.14.3.5. __nv_atomic_thread_fence()
- 10.15. Address Space Predicate Functions
- 10.15.1. __isGlobal()
- 10.15.2. __isShared()
- 10.15.3. __isConstant()
- 10.15.4. __isGridConstant()
- 10.15.5. __isLocal()
- 10.16. Address Space Conversion Functions
- 10.16.1. __cvta_generic_to_global()
- 10.16.2. __cvta_generic_to_shared()
- 10.16.3. __cvta_generic_to_constant()
- 10.16.4. __cvta_generic_to_local()
- 10.16.5. __cvta_global_to_generic()
- 10.16.6. __cvta_shared_to_generic()
- 10.16.7. __cvta_constant_to_generic()
- 10.16.8. __cvta_local_to_generic()
- 10.17. Alloca Function
- 10.17.1. Synopsis
- 10.17.2. Description
- 10.17.3. Example
- 10.18. Compiler Optimization Hint Functions
- 10.18.1. __builtin_assume_aligned()
- 10.18.2. __builtin_assume()
- 10.18.3. __assume()
- 10.18.4. __builtin_expect()
- 10.18.5. __builtin_unreachable()
- 10.18.6. Restrictions
- 10.19. Warp Vote Functions
- 10.20. Warp Match Functions
- 10.20.1. Synopsis
- 10.20.2. Description
- 10.21. Warp Reduce Functions
- 10.21.1. Synopsis
- 10.21.2. Description
- 10.22. Warp Shuffle Functions
- 10.22.1. Synopsis
- 10.22.2. Description
- 10.22.3. Examples
- 10.22.3.1. Broadcast of a single value across a warp
- 10.22.3.2. Inclusive plus-scan across sub-partitions of 8 threads
- 10.22.3.3. Reduction across a warp
- 10.23. Nanosleep Function
- 10.23.1. Synopsis
- 10.23.2. Description
- 10.23.3. Example
- 10.24. Warp Matrix Functions
- 10.24.1. Description
- 10.24.2. Alternate Floating Point
- 10.24.3. Double Precision
- 10.24.4. Sub-byte Operations
- 10.24.5. Restrictions
- 10.24.6. Element Types and Matrix Sizes
- 10.24.7. Example
- 10.25. DPX
- 10.25.1. Examples
- 10.26. Asynchronous Barrier
- 10.26.1. Simple Synchronization Pattern
- 10.26.2. Temporal Splitting and Five Stages of Synchronization
- 10.26.3. Bootstrap Initialization, Expected Arrival Count, and Participation
- 10.26.4. A Barrier’s Phase: Arrival, Countdown, Completion, and Reset
- 10.26.5. Spatial Partitioning (also known as Warp Specialization)
- 10.26.6. Early Exit (Dropping out of Participation)
- 10.26.7. Completion Function
- 10.26.8. Memory Barrier Primitives Interface
- 10.26.8.1. Data Types
- 10.26.8.2. Memory Barrier Primitives API
- 10.27. Asynchronous Data Copies
- 10.27.1. memcpy_async API
- 10.27.2. Copy and Compute Pattern - Staging Data Through Shared Memory
- 10.27.3. Without memcpy_async
- 10.27.4. With memcpy_async
- 10.27.5. Asynchronous Data Copies using cuda::barrier
- 10.27.6. Performance Guidance for memcpy_async
- 10.27.6.1. Alignment
- 10.27.6.2. Trivially copyable
- 10.27.6.3. Warp Entanglement - Commit
- 10.27.6.4. Warp Entanglement - Wait
- 10.27.6.5. Warp Entanglement - Arrive-On
- 10.27.6.6. Keep Commit and Arrive-On Operations Converged
- 10.28. Asynchronous Data Copies using cuda::pipeline
- 10.28.1. Single-Stage Asynchronous Data Copies using cuda::pipeline
- 10.28.2. Multi-Stage Asynchronous Data Copies using cuda::pipeline
- 10.28.3. Pipeline Interface
- 10.28.4. Pipeline Primitives Interface
- 10.28.4.1. memcpy_async Primitive
- 10.28.4.2. Commit Primitive
- 10.28.4.3. Wait Primitive
- 10.28.4.4. Arrive On Barrier Primitive
- 10.29. Asynchronous Data Copies using the Tensor Memory Accelerator (TMA)
- 10.29.1. Using TMA to transfer one-dimensional arrays
- 10.29.2. Using TMA to transfer multi-dimensional arrays
- 10.29.2.1. Multi-dimensional TMA PTX wrappers
- 10.29.3. TMA Swizzle
- 10.29.3.1. Example ‘Matrix Transpose’
- 10.29.3.2. The Swizzle Modes
- 10.30. Encoding a Tensor Map on Device
- 10.30.1. Device-side Encoding and Modification of a Tensor Map
- 10.30.2. Usage of a Modified Tensor Map
- 10.30.3. Creating a Template Tensor Map Value Using the Driver API
- 10.31. Profiler Counter Function
- 10.32. Assertion
- 10.33. Trap function
- 10.34. Breakpoint Function
- 10.35. Formatted Output
- 10.35.1. Format Specifiers
- 10.35.2. Limitations
- 10.35.3. Associated Host-Side API
- 10.35.4. Examples
- 10.36. Dynamic Global Memory Allocation and Operations
- 10.36.1. Heap Memory Allocation
- 10.36.2. Interoperability with Host Memory API
- 10.36.3. Examples
- 10.36.3.1. Per Thread Allocation
- 10.36.3.2. Per Thread Block Allocation
- 10.36.3.3. Allocation Persisting Between Kernel Launches
- 10.37. Execution Configuration
- 10.38. Launch Bounds
- 10.39. Maximum Number of Registers per Thread
- 10.40. #pragma unroll
- 10.41. SIMD Video Instructions
- 10.42. Diagnostic Pragmas
- 10.43. Custom ABI Pragmas
- 10.44. CUDA C++ Memory Model
- 10.45. CUDA C++ Execution Model
- 11. Cooperative Groups
- 11.1. Introduction
- 11.2. What’s New in Cooperative Groups
- 11.2.1. CUDA 13.0
- 11.2.2. CUDA 12.2
- 11.2.3. CUDA 12.1
- 11.2.4. CUDA 12.0
- 11.3. Programming Model Concept
- 11.3.1. Composition Example
- 11.4. Group Types
- 11.4.1. Implicit Groups
- 11.4.1.1. Thread Block Group
- 11.4.1.2. Cluster Group
- 11.4.1.3. Grid Group
- 11.4.2. Explicit Groups
- 11.4.2.1. Thread Block Tile
- 11.4.2.1.1. Warp-Synchronous Code Pattern
- 11.4.2.1.2. Single Thread Group
- 11.4.2.2. Coalesced Groups
- 11.4.2.2.1. Discovery Pattern
- 11.5. Group Partitioning
- 11.5.1. tiled_partition
- 11.5.2. labeled_partition
- 11.5.3. binary_partition
- 11.6. Group Collectives
- 11.6.1. Synchronization
- 11.6.1.1. barrier_arrive and barrier_wait
- 11.6.1.2. sync
- 11.6.2. Data Transfer
- 11.6.2.1. memcpy_async
- 11.6.2.2. wait and wait_prior
- 11.6.3. Data Manipulation
- 11.6.3.1. reduce
- 11.6.3.2. Reduce Operators
- 11.6.3.3. inclusive_scan and exclusive_scan
- 11.6.4. Execution control
- 11.6.4.1. invoke_one and invoke_one_broadcast
- 11.7. Grid Synchronization
- 12. Cluster Launch Control
- 12.1. Introduction
- 12.2. Cluster Launch Control API Details
- 12.2.1. Thread block cancellation steps
- 12.2.2. Thread block cancellation constraints
- 12.2.3. Kernel Example: Vector-Scalar Multiplication
- 12.2.4. Cluster Launch Control for Thread Block Clusters
- 13. CUDA Dynamic Parallelism
- 13.1. Introduction
- 13.1.1. Overview
- 13.1.2. Glossary
- 13.2. Execution Environment and Memory Model
- 13.2.1. Execution Environment
- 13.2.1.1. Parent and Child Grids
- 13.2.1.2. Scope of CUDA Primitives
- 13.2.1.3. Synchronization
- 13.2.1.4. Streams and Events
- 13.2.1.5. Ordering and Concurrency
- 13.2.1.6. Device Management
- 13.2.2. Memory Model
- 13.2.2.1. Coherence and Consistency
- 13.2.2.1.1. Global Memory
- 13.2.2.1.2. Zero Copy Memory
- 13.2.2.1.3. Constant Memory
- 13.2.2.1.4. Shared and Local Memory
- 13.2.2.1.5. Local Memory
- 13.2.2.1.6. Texture Memory
- 13.3. Programming Interface
- 13.3.1. CUDA C++ Reference
- 13.3.1.1. Device-Side Kernel Launch
- 13.3.1.1.1. Launches are Asynchronous
- 13.3.1.1.2. Launch Environment Configuration
- 13.3.1.2. Streams
- 13.3.1.2.1. The Implicit (NULL) Stream
- 13.3.1.2.2. The Fire-and-Forget Stream
- 13.3.1.2.3. The Tail Launch Stream
- 13.3.1.3. Events
- 13.3.1.4. Synchronization
- 13.3.1.5. Device Management
- 13.3.1.6. Memory Declarations
- 13.3.1.6.1. Device and Constant Memory
- 13.3.1.6.2. Textures and Surfaces
- 13.3.1.6.3. Shared Memory Variable Declarations
- 13.3.1.6.4. Symbol Addresses
- 13.3.1.7. API Errors and Launch Failures
- 13.3.1.7.1. Launch Setup APIs
- 13.3.1.8. API Reference
- 13.3.2. Device-side Launch from PTX
- 13.3.2.1. Kernel Launch APIs
- 13.3.2.1.1. cudaLaunchDevice
- 13.3.2.1.2. cudaGetParameterBuffer
- 13.3.2.2. Parameter Buffer Layout
- 13.3.3. Toolkit Support for Dynamic Parallelism
- 13.3.3.1. Including Device Runtime API in CUDA Code
- 13.3.3.2. Compiling and Linking
- 13.4. Programming Guidelines
- 13.4.1. Basics
- 13.4.2. Performance
- 13.4.2.1. Dynamic-parallelism-enabled Kernel Overhead
- 13.4.3. Implementation Restrictions and Limitations
- 13.4.3.1. Runtime
- 13.4.3.1.1. Memory Footprint
- 13.4.3.1.2. Pending Kernel Launches
- 13.4.3.1.3. Configuration Options
- 13.4.3.1.4. Memory Allocation and Lifetime
- 13.4.3.1.5. SM Id and Warp Id
- 13.4.3.1.6. ECC Errors
- 13.5. CDP2 vs CDP1
- 13.5.1. Differences Between CDP1 and CDP2
- 13.5.2. Compatibility and Interoperability
- 13.6. Legacy CUDA Dynamic Parallelism (CDP1)
- 13.6.1. Execution Environment and Memory Model (CDP1)
- 13.6.1.1. Execution Environment (CDP1)
- 13.6.1.1.1. Parent and Child Grids (CDP1)
- 13.6.1.1.2. Scope of CUDA Primitives (CDP1)
- 13.6.1.1.3. Synchronization (CDP1)
- 13.6.1.1.4. Streams and Events (CDP1)
- 13.6.1.1.5. Ordering and Concurrency (CDP1)
- 13.6.1.1.6. Device Management (CDP1)
- 13.6.1.2. Memory Model (CDP1)
- 13.6.1.2.1. Coherence and Consistency (CDP1)
- 13.6.1.2.1.1. Global Memory (CDP1)
- 13.6.1.2.1.2. Zero Copy Memory (CDP1)
- 13.6.1.2.1.3. Constant Memory (CDP1)
- 13.6.1.2.1.4. Shared and Local Memory (CDP1)
- 13.6.1.2.1.5. Local Memory (CDP1)
- 13.6.1.2.1.6. Texture Memory (CDP1)
- 13.6.2. Programming Interface (CDP1)
- 13.6.2.1. CUDA C++ Reference (CDP1)
- 13.6.2.1.1. Device-Side Kernel Launch (CDP1)
- 13.6.2.1.1.1. Launches are Asynchronous (CDP1)
- 13.6.2.1.1.2. Launch Environment Configuration (CDP1)
- 13.6.2.1.2. Streams (CDP1)
- 13.6.2.1.2.1. The Implicit (NULL) Stream (CDP1)
- 13.6.2.1.3. Events (CDP1)
- 13.6.2.1.4. Synchronization (CDP1)
- 13.6.2.1.4.1. Block Wide Synchronization (CDP1)
- 13.6.2.1.5. Device Management (CDP1)
- 13.6.2.1.6. Memory Declarations (CDP1)
- 13.6.2.1.6.1. Device and Constant Memory (CDP1)
- 13.6.2.1.6.2. Textures and Surfaces (CDP1)
- 13.6.2.1.6.3. Shared Memory Variable Declarations (CDP1)
- 13.6.2.1.6.4. Symbol Addresses (CDP1)
- 13.6.2.1.7. API Errors and Launch Failures (CDP1)
- 13.6.2.1.7.1. Launch Setup APIs (CDP1)
- 13.6.2.1.8. API Reference (CDP1)
- 13.6.2.2. Device-side Launch from PTX (CDP1)
- 13.6.2.2.1. Kernel Launch APIs (CDP1)
- 13.6.2.2.1.1. cudaLaunchDevice (CDP1)
- 13.6.2.2.1.2. cudaGetParameterBuffer (CDP1)
- 13.6.2.2.2. Parameter Buffer Layout (CDP1)
- 13.6.2.3. Toolkit Support for Dynamic Parallelism (CDP1)
- 13.6.2.3.1. Including Device Runtime API in CUDA Code (CDP1)
- 13.6.2.3.2. Compiling and Linking (CDP1)
- 13.6.3. Programming Guidelines (CDP1)
- 13.6.3.1. Basics (CDP1)
- 13.6.3.2. Performance (CDP1)
- 13.6.3.2.1. Synchronization (CDP1)
- 13.6.3.2.2. Dynamic-parallelism-enabled Kernel Overhead (CDP1)
- 13.6.3.3. Implementation Restrictions and Limitations (CDP1)
- 13.6.3.3.1. Runtime (CDP1)
- 13.6.3.3.1.1. Memory Footprint (CDP1)
- 13.6.3.3.1.2. Nesting and Synchronization Depth (CDP1)
- 13.6.3.3.1.3. Pending Kernel Launches (CDP1)
- 13.6.3.3.1.4. Configuration Options (CDP1)
- 13.6.3.3.1.5. Memory Allocation and Lifetime (CDP1)
- 13.6.3.3.1.6. SM Id and Warp Id (CDP1)
- 13.6.3.3.1.7. ECC Errors (CDP1)
- 14. Virtual Memory Management
- 14.1. Introduction
- 14.2. Query for Support
- 14.3. Allocating Physical Memory
- 14.3.1. Shareable Memory Allocations
- 14.3.2. Memory Type
- 14.3.2.1. Compressible Memory
- 14.4. Reserving a Virtual Address Range
- 14.5. Virtual Aliasing Support
- 14.6. Mapping Memory
- 14.7. Controlling Access Rights
- 14.8. Fabric Memory
- 14.8.1. Query for Support
- 14.9. Multicast Support
- 14.9.1. Query for Support
- 14.9.2. Allocating Multicast Objects
- 14.9.3. Add Devices to Multicast Objects
- 14.9.4. Bind Memory to Multicast Objects
- 14.9.5. Use Multicast Mappings
- 15. Stream Ordered Memory Allocator
- 15.1. Introduction
- 15.2. Query for Support
- 15.3. API Fundamentals (cudaMallocAsync and cudaFreeAsync)
- 15.4. Memory Pools and the cudaMemPool_t
- 15.5. Default/Implicit Pools
- 15.6. Explicit Pools
- 15.7. Physical Page Caching Behavior
- 15.8. Resource Usage Statistics
- 15.9. Memory Reuse Policies
- 15.9.1. cudaMemPoolReuseFollowEventDependencies
- 15.9.2. cudaMemPoolReuseAllowOpportunistic
- 15.9.3. cudaMemPoolReuseAllowInternalDependencies
- 15.9.4. Disabling Reuse Policies
- 15.10. Device Accessibility for Multi-GPU Support
- 15.11. IPC Memory Pools
- 15.11.1. Creating and Sharing IPC Memory Pools
- 15.11.2. Set Access in the Importing Process
- 15.11.3. Creating and Sharing Allocations from an Exported Pool
- 15.11.4. IPC Export Pool Limitations
- 15.11.5. IPC Import Pool Limitations
- 15.12. Synchronization API Actions
- 15.13. Addendums
- 15.13.1. cudaMemcpyAsync Current Context/Device Sensitivity
- 15.13.2. cuPointerGetAttribute Query
- 15.13.3. cuGraphAddMemsetNode
- 15.13.4. Pointer Attributes
- 15.13.5. CPU Virtual Memory
- 16. Graph Memory Nodes
- 16.1. Introduction
- 16.2. Support and Compatibility
- 16.3. API Fundamentals
- 16.3.1. Graph Node APIs
- 16.3.2. Stream Capture
- 16.3.3. Accessing and Freeing Graph Memory Outside of the Allocating Graph
- 16.3.4. cudaGraphInstantiateFlagAutoFreeOnLaunch
- 16.4. Optimized Memory Reuse
- 16.4.1. Address Reuse within a Graph
- 16.4.2. Physical Memory Management and Sharing
- 16.5. Performance Considerations
- 16.5.1. First Launch / cudaGraphUpload
- 16.6. Physical Memory Footprint
- 16.7. Peer Access
- 16.7.1. Peer Access with Graph Node APIs
- 16.7.2. Peer Access with Stream Capture
- 16.8. Memory Nodes in Child Graphs
- 17. Mathematical Functions
- 17.1. Standard Functions
- 17.2. Intrinsic Functions
- 18. C++ Language Support
- 18.1. C++11 Language Features
- 18.2. C++14 Language Features
- 18.3. C++17 Language Features
- 18.4. C++20 Language Features
- 18.5. Restrictions
- 18.5.1. Host Compiler Extensions
- 18.5.2. Preprocessor Symbols
- 18.5.2.1. CUDA_ARCH
- 18.5.3. Qualifiers
- 18.5.3.1. Device Memory Space Specifiers
- 18.5.3.2. managed Memory Space Specifier
- 18.5.3.3. Volatile Qualifier
- 18.5.4. Pointers
- 18.5.5. Operators
- 18.5.5.1. Assignment Operator
- 18.5.5.2. Address Operator
- 18.5.6. Run Time Type Information (RTTI)
- 18.5.7. Exception Handling
- 18.5.8. Standard Library
- 18.5.9. Namespace Reservations
- 18.5.10. Functions
- 18.5.10.1. External Linkage
- 18.5.10.2. Implicitly-declared and non-virtual explicitly-defaulted functions
- 18.5.10.3. Function Parameters
- 18.5.10.3.1. global Function Argument Processing
- 18.5.10.3.2. Toolkit and Driver Compatibility
- 18.5.10.3.3. Link Compatibility across Toolkit Revisions
- 18.5.10.4. Static Variables within Function
- 18.5.10.5. Function Pointers
- 18.5.10.6. Function Recursion
- 18.5.10.7. Friend Functions
- 18.5.10.8. Operator Function
- 18.5.10.9. Allocation and Deallocation Functions
- 18.5.11. Classes
- 18.5.11.1. Data Members
- 18.5.11.2. Function Members
- 18.5.11.3. Virtual Functions
- 18.5.11.4. Virtual Base Classes
- 18.5.11.5. Anonymous Unions
- 18.5.11.6. Windows-Specific
- 18.5.12. Templates
- 18.5.13. Trigraphs and Digraphs
- 18.5.14. Const-qualified variables
- 18.5.15. Long Double
- 18.5.16. Deprecation Annotation
- 18.5.17. Noreturn Annotation
- 18.5.18. [[likely]] / [[unlikely]] Standard Attributes
- 18.5.19. const and pure GNU Attributes
- 18.5.20. nv_pure Attribute
- 18.5.21. Intel Host Compiler Specific
- 18.5.22. C++11 Features
- 18.5.22.1. Lambda Expressions
- 18.5.22.2. std::initializer_list
- 18.5.22.3. Rvalue references
- 18.5.22.4. Constexpr functions and function templates
- 18.5.22.5. Constexpr variables
- 18.5.22.6. Inline namespaces
- 18.5.22.6.1. Inline unnamed namespaces
- 18.5.22.7. thread_local
- 18.5.22.8. global functions and function templates
- 18.5.22.9. managed and shared variables
- 18.5.22.10. Defaulted functions
- 18.5.23. C++14 Features
- 18.5.23.1. Functions with deduced return type
- 18.5.23.2. Variable templates
- 18.5.24. C++17 Features
- 18.5.24.1. Inline Variable
- 18.5.24.2. Structured Binding
- 18.5.25. C++20 Features
- 18.5.25.1. Module support
- 18.5.25.2. Coroutine support
- 18.5.25.3. Three-way comparison operator
- 18.5.25.4. Consteval functions
- 18.6. Polymorphic Function Wrappers
- 18.7. Extended Lambdas
- 18.7.1. Extended Lambda Type Traits
- 18.7.2. Extended Lambda Restrictions
- 18.7.3. Notes on host device lambdas
- 18.7.4. *this Capture By Value
- 18.7.5. Additional Notes
- 18.8. Relaxed Constexpr (-expt-relaxed-constexpr)
- 18.9. Code Samples
- 18.9.1. Data Aggregation Class
- 18.9.2. Derived Class
- 18.9.3. Class Template
- 18.9.4. Function Template
- 18.9.5. Functor Class
- 19. Texture Fetching
- 19.1. Nearest-Point Sampling
- 19.2. Linear Filtering
- 19.3. Table Lookup
- 20. Compute Capabilities
- 20.1. Feature Availability
- 20.1.1. Architecture-Specific Features
- 20.1.2. Family-Specific Features
- 20.1.3. Feature Set Compiler Targets
- 20.2. Features and Technical Specifications
- 20.3. Floating-Point Standard
- 20.4. Compute Capability 5.x
- 20.4.1. Architecture
- 20.4.2. Global Memory
- 20.4.3. Shared Memory
- 20.5. Compute Capability 6.x
- 20.5.1. Architecture
- 20.5.2. Global Memory
- 20.5.3. Shared Memory
- 20.6. Compute Capability 7.x
- 20.6.1. Architecture
- 20.6.2. Independent Thread Scheduling
- 20.6.3. Global Memory
- 20.6.4. Shared Memory
- 20.7. Compute Capability 8.x
- 20.7.1. Architecture
- 20.7.2. Global Memory
- 20.7.3. Shared Memory
- 20.8. Compute Capability 9.0
- 20.8.1. Architecture
- 20.8.2. Global Memory
- 20.8.3. Shared Memory
- 20.8.4. Features Accelerating Specialized Computations
- 20.9. Compute Capability 10.0
- 20.9.1. Architecture
- 20.9.2. Global Memory
- 20.9.3. Shared Memory
- 20.9.4. Features Accelerating Specialized Computations
- 20.10. Compute Capability 12.0
- 20.10.1. Architecture
- 20.10.2. Global Memory
- 20.10.3. Shared Memory
- 20.10.4. Features Accelerating Specialized Computations
- 21. Driver API
- 21.1. Context
- 21.2. Module
- 21.3. Kernel Execution
- 21.4. Interoperability between Runtime and Driver APIs
- 21.5. Driver Entry Point Access
- 21.5.1. Introduction
- 21.5.2. Driver Function Typedefs
- 21.5.3. Driver Function Retrieval
- 21.5.3.1. Using the Driver API
- 21.5.3.2. Using the Runtime API
- 21.5.3.3. Retrieve Per-thread Default Stream Versions
- 21.5.3.4. Access New CUDA features
- 21.5.4. Guidelines for cuGetProcAddress
- 21.5.4.1. Guidelines for Runtime API Usage
- 21.5.5. Determining cuGetProcAddress Failure Reasons
- 22. CUDA Environment Variables
- 23. Error Log Management
- 23.1. Background
- 23.2. Activation
- 23.3. Output
- 23.4. API Description
- 23.5. Limitations and Known Issues
- 24. Unified Memory Programming
- 24.1. Unified Memory Introduction
- 24.1.1. System Requirements for Unified Memory
- 24.1.2. Programming Model
- 24.1.2.1. Allocation APIs for System-Allocated Memory
- 24.1.2.2. Allocation API for CUDA Managed Memory: cudaMallocManaged()
- 24.1.2.3. Global-Scope Managed Variables Using managed
- 24.1.2.4. Difference between Unified Memory and Mapped Memory
- 24.1.2.5. Pointer Attributes
- 24.1.2.6. Runtime detection of Unified Memory Support Level
- 24.1.2.7. GPU Memory Oversubscription
- 24.1.2.8. Performance Hints
- 24.1.2.8.1. Data Prefetching
- 24.1.2.8.2. Data Usage Hints
- 24.1.2.8.3. Querying Data Usage Attributes on Managed Memory
- 24.2. Unified memory on devices with full CUDA Unified Memory support
- 24.2.1. System-Allocated Memory: in-depth examples
- 24.2.1.1. File-backed Unified Memory
- 24.2.1.2. Inter-Process Communication (IPC) with Unified Memory
- 24.2.2. Performance Tuning
- 24.2.2.1. Memory Paging and Page Sizes
- 24.2.2.1.1. Choosing the right page size
- 24.2.2.1.2. CPU and GPU page tables: hardware coherency vs. software coherency
- 24.2.2.2. Direct Unified Memory Access from host
- 24.2.2.3. Host Native Atomics
- 24.2.2.4. Atomic accesses & synchronization primitives
- 24.2.2.5. Memcpy()/Memset() Behavior With Unified Memory
- 24.3. Unified memory on devices without full CUDA Unified Memory support
- 24.3.1. Unified memory on devices with only CUDA Managed Memory support
- 24.3.2. Unified memory on Windows or devices with compute capability 5.x
- 24.3.2.1. Data Migration and Coherency
- 24.3.2.2. GPU Memory Oversubscription
- 24.3.2.3. Multi-GPU
- 24.3.2.4. Coherency and Concurrency
- 24.3.2.4.1. GPU Exclusive Access To Managed Memory
- 24.3.2.4.2. Explicit Synchronization and Logical GPU Activity
- 24.3.2.4.3. Managing Data Visibility and Concurrent CPU + GPU Access with Streams
- 24.3.2.4.4. Stream Association Examples
- 24.3.2.4.5. Stream Attach With Multithreaded Host Programs
- 24.3.2.4.6. Advanced Topic: Modular Programs and Data Access Constraints
- 24.3.2.4.7. Memcpy()/Memset() Behavior With Stream-associated Unified Memory
- 25. Lazy Loading
- 25.1. What is Lazy Loading?
- 25.2. Lazy Loading version support
- 25.2.1. Driver
- 25.2.2. Toolkit
- 25.2.3. Compiler
- 25.3. Triggering loading of kernels in lazy mode
- 25.3.1. CUDA Driver API
- 25.3.2. CUDA Runtime API
- 25.4. Querying whether Lazy Loading is Turned On
- 25.5. Possible Issues when Adopting Lazy Loading
- 25.5.1. Concurrent Execution
- 25.5.2. Allocators
- 25.5.3. Autotuning
- 26. Extended GPU Memory
- 26.1. Preliminaries
- 26.1.1. EGM Platforms: System topology
- 26.1.2. Socket Identifiers: What are they? How to access them?
- 26.1.3. Allocators and EGM support
- 26.1.4. Memory management extensions to current APIs
- 26.2. Using the EGM Interface
- 26.2.1. Single-Node, Single-GPU
- 26.2.2. Single-Node, Multi-GPU
- 26.2.2.1. Using VMM APIs
- 26.2.2.2. Using CUDA Memory Pool
- 26.2.3. Multi-Node, Single-GPU
- 27. Notices
- 27.1. Notice
- 27.2. OpenCL
- 27.3. Trademarks
- Privacy Policy
- Manage My Privacy
- Do Not Sell or Share My Data
- Terms of Service
- Accessibility
- Corporate Policies
- Product Security
- Contact