CUDA Pro Tip: The Fast Way to Query Device Properties

This post was updated in April 2025 to reflect performance on current hardware and software.

CUDA applications often need to know the maximum available shared memory per block or to query the number of multiprocessors in the active GPU. One way to do this is by calling cudaGetDeviceProperties(). Unfortunately, calling this function inside a performance-critical section of your code can lead to huge slowdowns, depending on your code. We found out the hard way when cudaGetDeviceProperties() caused a 20x slowdown in the Random Forests algorithm in cuML.

Here is a very simple CUDA “pro tip”: cudaDeviceGetAttribute() is a much faster way to query device properties.

Just the Facts You Need

Typically you don’t need to know all the properties of the GPU you are running on. Often you just need one or two, like the maximum block size, the number of multiprocessors, or the maximum shared memory per block. But cudaGetDeviceProperties() gives you everything, whether you need it or not. So it’s usually overkill to call this function, and you will pay for it, because some device properties, for example, require PCIe reads to query, which are expensive.

In contrast, cudaDeviceGetAttribute gives you one attribute per call—just the one you ask for. That makes it much faster for most attributes. We are talking orders of magnitude faster: nanoseconds vs. milliseconds. Let’s get some numbers.

Benchmarking Device Attribute Queries

We wrote a simple benchmark to compare the performance of cudaGetDeviceProperties() and cudaDeviceGetAttribute(). ?The timings were captured using a single NVIDIA GH200 with driver v570.140 and CUDA Toolkit 12.8. The benchmark compares getting a full cudaDeviceProp struct using cudaGetDeviceProperties() to just querying the maximum shared memory per block and number of multiprocessors using two calls to cudaDeviceGetAttribute(). It averages the run-time over 100 iterations. Here’s the test code for cudaGetDeviceProperties().

#include <iostream>
#include <chrono>

using namespace std;

int main() {

????cudaDeviceProp prop;
????int devId;
????cudaGetDevice(&devId);
????auto start = chrono::high_resolution_clock::now();

????for(int i = 0; i < 100; ++i) {
????????cudaError_t err = cudaGetDeviceProperties(&prop, devId);
????}

????auto end = chrono::high_resolution_clock::now();
????cout << "cudaGetDeviceProperties -> "
?????????<< chrono::duration_cast<chrono::microseconds>(end - start).count() / 100.0
?????????<< "us" << endl;

????return 0;
} /* end main */

Output:

cudaGetDeviceProperties -> 864.39us

Here’s the test code for cudaDeviceGetAttribute().

#include <iostream>
#include <chrono>

using namespace std;

int main() {

????int smemSize, numProcs, devId;
????cudaGetDevice(&devId);
????auto start = chrono::high_resolution_clock::now();
????for (int i = 0; i < 100; ++i) {
????????cudaDeviceGetAttribute(&smemSize,
????????????cudaDevAttrMaxSharedMemoryPerBlock, devId);
????????cudaDeviceGetAttribute(&numProcs,
????????????cudaDevAttrMultiProcessorCount, devId);
????}

????auto end = chrono::high_resolution_clock::now();
????cout << "cudaDeviceGetAttribute -> "
????????<< chrono::duration_cast<chrono::microseconds>(end - start).count() / 100.0
????????<< "us" << endl;

????return 0;
} /* end main */

Output:

cudaDeviceGetAttribute -> 0.03us

As you can see, `cudaDeviceGetAttribute() is four orders of magnitude faster than cudaGetDeviceProperties() for these attributes: 30 nanoseconds vs. 0.864 milliseconds.

Caution: Some Attributes are Expensive

As we mentioned, some device properties require expensive PCIe reads, which is why cudaGetDeviceProperties() is slow. For the same reason, the following properties are much slower than others to query using cudaDeviceGetAttribute(): cudaDevAttrClockRate, cudaDevAttrKernelExecTimeout, cudaDevAttrMemoryClockRate, and cudaDevAttrSingleToDoublePrecisionPerfRatio.