• <xmp id="om0om">
  • <table id="om0om"><noscript id="om0om"></noscript></table>
  • Simulation / Modeling / Design

    CUDA Pro Tip: How to Call Batched cuBLAS routines from CUDA Fortran

    GPU Pro Tip
    CUDA Fortran for Scientists and Engineers shows how high-performance application developers can leverage the power of GPUs using Fortran.
    CUDA Fortran for Scientists and Engineers shows how high-performance application developers can leverage the power of GPUs using Fortran.

    When dealing with small arrays and matrices, one method of exposing parallelism on the GPU is to execute the same cuBLAS call on multiple independent systems simultaneously.?While you can do this manually by calling multiple cuBLAS kernels across multiple CUDA streams, batched cuBLAS routines enable such parallelism automatically for certain operations (GEMM, GETRF, GETRI, and TRSM).? In this post I’ll?show you how to leverage these batched routines from CUDA Fortran.

    The C interface batched cuBLAS functions use an array of pointers as one of their arguments, where each pointer in the array points to an independent matrix. This poses a problem for Fortran, which does not allow arrays of pointers. To accommodate this argument, we can make use of the data types declared in the ISO_C_BINDING module, in particular the c_devptr type.? Let’s illustrate this with a code that calls the batched SGETRF cuBLAS routine.

    Writing Interfaces to Batched cuBLAS Routines

    At the time of writing this post, the batched cuBLAS routines are not in the CUDA Fortran cublas module, so we first need to define the interface to the cublasSgetrfBatched() call:

    interface?
    ? integer(c_int) function &
    ? ? ? cublasSgetrfBatched(h,n,Aarray,lda,ipvt,info,batchSize) &
    ? ? ? bind(c,name='cublasSgetrfBatched')?
    ? ? use iso_c_binding?
    ? ? use cublas?
    ? ? type(cublasHandle), value :: h?
    ? ? integer(c_int), value :: n?
    ? ? type(c_devptr), device :: Aarray(*)?
    ? ? integer(c_int), value :: lda
    ? ? integer(c_int), device :: ipvt(*)?
    ? ? integer(c_int), device :: info(*)?
    ? ? integer(c_int), value :: batchSize?
    ? end function cublasSgetrfBatched
    end interface

    The arguments of cublasSgetrfBatched() are:

    • h, the cuBLAS handle obtained from?cublasCreate();
    • n, the size of the n×n matrices;
    • Aarray, ?the array pointers to the matrices;
    • lda, the leading dimension of the matrices;
    • ipvt, an output array containing pivot information;
    • info, another output array containing factorization information;
    • and batchSize, the number of independent n×n matrices on which to perform factorization.

    Of particular importance in this interface is the use of the variable attributes device and value.?The cuBLAS handle h, the size of the matrices n, the leading dimensions lda, and the number of matrices batchSize are all scalar parameters declared in host memory and are therefore passed by value.?The integer arrays ipvt and info are output arrays allocated on the device using?the device attribute.?Of special note in this interface is the array of pointers Aarray.

    ??type(c_devptr), device :: Aarray(*)

    which is a device array of device pointers.? Contrast this to a declaration such as:

    ? type(c_devptr) :: Aarray(*)

    which is a host array of device pointers.?The batched cuBLAS calls need the former declaration because cuBLAS distributes?matrices to kernel threads on the device.

    Calling Batched cuBLAS from Host Code

    Having defined our interface to cublasSgetrfBatched(), we can now turn to the host code that calls this function. ?The complete code follows.

    program testgetrfBatched
      use cudafor?
      use cublas?
      use iso_c_binding? 
      implicit none?
      integer, parameter :: n=2, nbatch=3, lda=n
      real :: a(n,n,nbatch)?
      real, device :: a_d(n,n,nbatch)
      type(c_devptr) :: devPtrA(nbatch)?
      type(c_devptr), device :: devPtrA_d(nbatch)?
      type(cublasHandle) :: h1?
      integer? :: ipvt(n*nbatch), info(nbatch)?
      integer, device? :: ipvt_d(n*nbatch), info_d(nbatch)
      integer :: i, k, istat
    
      interface?
        integer(c_int) function cublasSgetrfBatched(h, n, Aarray, lda, ipvt, info, batchSize) &
          bind(c,name='cublasSgetrfBatched')
          use iso_c_binding?
          use cublas?
          type(cublasHandle), value :: h?
          integer(c_int), value :: n?
          type(c_devptr), device :: Aarray(*)?
          integer(c_int), value :: lda
          integer(c_int), device :: ipvt(*)
          integer(c_int), device :: info(*)
          integer(c_int), value :: batchSize
        end function cublasSgetrfBatched
      end interface
    
      ! intitialize arrays and transfer to device
      do k = 1, nbatch
        a(1,1,k) = 6.0*k
        a(2,1,k) = 4.0*k
        a(1,2,k) = 3.0*k
        a(2,2,k) = 3.0*k
      end do
      a_d = a
    
      write(*,"(/,'Input:')")
      do k = 1, nbatch
        write(*,"(2x,'Matrix: ', i0)") k
        do i=1, n
          write(*,*) a(i,:,k)
        enddo
      enddo
    
      ! build an array of pointers
    
      do k = 1, nbatch
        devPtrA(k) = c_devloc(a_d(1,1,k))
      end do
      devPtrA_d = devPtrA
    
      ! create handle, call cublasSgetrfBatched, and destroy handle
    
      istat = cublasCreate(h1)
      if (istat /= CUBLAS_STATUS_SUCCESS) write(*,*) 'cublasCreate failed'
      istat= cublasSgetrfBatched(h1, n, devPtrA_d, lda, ipvt_d, info_d, nbatch)
      if (istat /= CUBLAS_STATUS_SUCCESS) write(*,*) 'cublasSgetrfBatched failed: ', istat
      istat = cublasDestroy(h1)
      if (istat /= CUBLAS_STATUS_SUCCESS) write(*,*) 'cublasDestroy failed'
    
      a = a_d
      write(*,"(/, 'LU Factorization:')")
      do k = 1, nbatch
        write(*,"(2x,'Matrix: ', i0)") k
        do i = 1, n
          write(*,*) a(i,:,k)
        enddo
      enddo
    end program testgetrfBatched

    While most of this code is straightforward, the code that assembles the device array of device pointers, devPtrA_d, merits some discussion.

      do k = 1, nbatch
        devPtrA(k) = c_devloc(a_d(1,1,k))
      end do
      devPtrA_d = devPtrA

    The host function c_devloc() determines the address of the nbatch matrices stored in the three-dimensional array a_d. In general the matrices can be distributed across multiple variables, but we use a single three-dimensional array for convenience. The results from c_devloc() are first stored in devPtrA, a host array of device pointers, and then transferred to the device array devPtrA_d.?devPtrA_d is used a few lines later in the call to cublasSgetrfBatched():

    istat= cublasSgetrfBatched(h1, n, devPtrA_d, lda, ipvt_d, info_d, nbatch)

    Other Batched Routines

    We can use a similar approach for the other batched cuBLAS routines: cublas*getriBatched(), cublas*gemmBatched(), and cublas*trsmBatched().?Note that in cublas*gemmBatched() and cublas*trsmBatched(), the parameters alpha and beta are scalar values passed by reference which can reside either on the host or device depending on the cuBLAS pointer mode.?In such cases it is helpful to write a generic interface such as the following.

    ??interface cublasSgemmBatched
    ?? ? integer(c_int) function &
    ?? ? ? cublasSgemmBatched_hpm(h, transa, transb, &
    ?? ? ? ? m, n, k, alpha, Aarray, lda, Barray, ldb, &
    ?? ? ? ? beta, Carray, ldc, batchCount) &
    ?? ? ? ? bind(c,name='cublasSgemmBatched')?
    ?? ? ? use iso_c_binding?
    ?? ? ? use cublas?
    ?? ? ? type(cublasHandle), value :: h?
    ?? ? ? integer, value :: transa?
    ?? ? ? integer, value :: transb
    ?? ? ? integer(c_int), value :: m, n, k
    ?? ? ? real :: alpha?
    ?? ? ? type(c_devptr), device :: Aarray(*)?
    ?? ? ? integer(c_int), value :: lda
    ?? ? ? type(c_devptr), device :: Barray(*)?
    ?? ? ? integer(c_int), value :: ldb
    ?? ? ? real :: beta
    ?? ? ? type(c_devptr), device :: Carray(*)
    ?? ? ? integer(c_int), value :: ldc
    ?? ? ? integer(c_int), value :: batchCount?
    ?? ? end function cublasSgemmBatched_hpm
    
    ? ? ?integer(c_int) function &
    ?? ? ? cublasSgemmBatched_dpm(h, transa, transb, &
    ?? ? ? ? m, n, k, alpha, Aarray, lda, Barray, ldb, &
    ?? ? ? ? beta, Carray, ldc, batchCount) &
    ?? ? ? ? bind(c,name='cublasSgemmBatched')?
    ?? ? ? use iso_c_binding?
    ?? ? ? use cublas?
    ?? ? ? type(cublasHandle), value :: h?
    ?? ? ? integer, value :: transa?
    ?? ? ? integer, value :: transb
    ?? ? ? integer(c_int), value :: m, n, k
    ?? ? ? real, device :: alpha?
    ?? ? ? type(c_devptr), device :: Aarray(*)?
    ?? ? ? integer(c_int), value :: lda
    ?? ? ? type(c_devptr), device :: Barray(*)?
    ?? ? ? integer(c_int), value :: ldb
    ?? ? ? real, device :: beta
    ?? ? ? type(c_devptr), device :: Carray(*)
    ?? ? ? integer(c_int), value :: ldc
    ?? ? ? integer(c_int), value :: batchCount?
    ?? ? end function cublasSgemmBatched_dpm
    ? end interface cublasSgemmBatched

    In this interface the *_hmp and *_dpm routines are for host and device pointer modes, respectively, whose interfaces differ only by the absence or presence of the device variable attribute for the alpha and beta arguments.?You could further write a wrapper function that calls cublasSetPointerMode() appropriately before executing the batched cuBLAS routine.

    Check out more on CUDA Fortran as well as more CUDA Pro Tips at the NVIDIA Developer Blog.

    Related resources

    Discuss (4)
    0

    Tags

    人人超碰97caoporen国产