Using The NVIDIA CUDA Stream-Ordered Memory Allocator Part 1
Most CUDA builders are conversant in the cudaMalloc and cudaFree API features to allocate GPU accessible memory. However, there has long been an obstacle with these API capabilities: they aren’t stream ordered. On this post, we introduce new API capabilities, cudaMallocAsync and cudaFreeAsync, that enable memory allocation and deallocation to be stream-ordered operations. Partially 2 of this collection, we highlight the advantages of this new capability by sharing some big information benchmark results and provide a code migration information for modifying your existing applications. We additionally cover superior matters to make the most of stream-ordered memory allocation within the context of multi-GPU access and using IPC. This all helps you enhance efficiency within your current applications. The following code example on the left is inefficient because the primary cudaFree call has to look forward to kernelA to finish, so it synchronizes the device earlier than freeing the memory. To make this run extra effectively, the memory might be allocated upfront and sized to the bigger of the two sizes, as proven on the suitable.
This will increase code complexity in the application as a result of the memory administration code is separated out from the enterprise logic. The issue is exacerbated when other libraries are concerned. This is much more durable for the application to make environment friendly as a result of it might not have complete visibility or control over what the library is doing. To bypass this drawback, the library must allocate memory when that operate is invoked for the first time and by no means free it until the library is deinitialized. This not solely increases code complexity, but it surely also causes the library to hold on to the memory longer than it needs to, probably denying another portion of the applying from using that memory. Some applications take the thought of allocating memory upfront even further by implementing their very own customized allocator. This adds a major quantity of complexity to application improvement. CUDA aims to supply a low-effort, excessive-efficiency different.
CUDA 11.2 introduced a stream-ordered memory allocator to solve these kinds of problems, with the addition of cudaMallocAsync and cudaFreeAsync. These new API functions shift Memory Wave Workshop allocation from world-scope operations that synchronize the entire machine to stream-ordered operations that enable you to compose memory management with GPU work submission. This eliminates the need for synchronizing outstanding GPU work and helps prohibit the lifetime of the allocation to the GPU work that accesses it. It is now potential to manage memory at function scope, as in the next instance of a library operate launching kernelA. All the usual stream-ordering rules apply to cudaMallocAsync and cudaFreeAsync. The memory returned from cudaMallocAsync can be accessed by any kernel or memcpy operation as lengthy because the kernel or memcpy is ordered to execute after the allocation operation and earlier than the deallocation operation, in stream order. Deallocation might be carried out in any stream, so long as it's ordered to execute after the allocation operation and in any case accesses on all streams of that memory on the GPU.
In impact, stream-ordered allocation behaves as if allocation and free had been kernels. If kernelA produces a sound buffer on a stream and kernelB invalidates it on the identical stream, then an software is free to entry the buffer after kernelA and before kernelB in the appropriate stream order. The following instance reveals varied legitimate usages. Figure 1 shows the various dependencies specified in the earlier code instance. As you'll be able to see, all kernels are ordered to execute after the allocation operation and complete earlier than the deallocation operation. Memory allocation and deallocation cannot fail asynchronously. Memory errors that happen due to a name to cudaMallocAsync or cudaFreeAsync (for instance, out of memory) are reported immediately by means of an error code returned from the decision. If cudaMallocAsync completes successfully, the returned pointer is assured to be a legitimate pointer to memory that is secure to entry in the appropriate stream order. The CUDA driver makes use of memory swimming pools to achieve the habits of returning a pointer immediately.