• <xmp id="om0om">
  • <table id="om0om"><noscript id="om0om"></noscript></table>
  • Simulation / Modeling / Design

    Customize CUDA Fortran Profiling with NVTX

    The NVIDIA Tools Extension (NVTX) library lets developers annotate custom events and ranges within the profiling timelines generated using tools such as the NVIDIA Visual Profiler (NVVP) and NSight. In my own optimization work, I rely heavily on NVTX to better understand internal as well as customer codes and to spot opportunities for better interaction between the CPU and the GPU.

    Two previous Pro Tip posts on Parallel Forall showed how to use NVTX in CUDA C++?and MPI?codes. In this post, I’ll show how to use NVTX to annotate the profiles of Fortran codes (with either CUDA Fortran or OpenACC).

    NVTX has a lot of features, but here I’ll focus on using it to annotate the profiler output with timeline markers using nvtxRangePush() and nvtxRangePop(). I’ll show you how to insert markers with custom labels and colors.

    To make it easy, I’ve written a Fortran module to instrument CUDA/OpenACC Fortran codes that works like the macro that Jiri Krauss wrote about in his post on NVTX. The nvtx module is simple to use. After loading the module, just call nvtxStartRange() / nvtxEndRange() to insert markers in the timeline. Calls to nvtxStartRange() with a single argument generate green markers, or you can specify one of seven available colors using an optional second integer parameter.

    The following test code generates a green labeled range encompassing the whole run, and 14 custom ranges labeled with the iteration number that cycle through the predefined set of seven colors.

    program main
      use nvtx
      character(len=4) :: itcount
    
      ! First range with standard color
      call nvtxStartRange("First label")
    
      do n=1,14
        ! Create custom label for each marker
        write(itcount,'(i4)') n
    
        ! Range with custom  color
        call nvtxStartRange("Label "//itcount,n)
    
        ! Add sleep to make markers big 
        call sleep(1)
    
        call nvtxEndRange
      end do
    
      call nvtxEndRange
    end program main

    To compile the code and generate an executable, pass the location of the libnvToolsExt (usually /usr/local/cuda/lib on 32-bit systems or /usr/local/cuda/lib64 on 64-bit systems) and the library name, as follows.

    $ pgf90 nvtx.cuf?-L/usr/local/cuda/lib -lnvToolsExt

    You can quickly generate profiler output by running nvprof and saving it to a file with the “-o” flag:

    $ nvprof -o profiler.output ./a.out
     ==10653== NVPROF is profiling process 10653, command: ./a.out
     ==10653== Generated result file: /Users/mfatica/profiler.output

    With CUDA 7.5, it is now possible to visualize the output straight from the command line with NVVP (In previous versions, you need to select “File->Import”, then select “Nvprof”, click “Next”, select “Single process”, and then browse to the output of nvprof.):

    $ nvvp -o profiler.output

    NVVP generates the timeline in Figure 1.

     

    Figure 1: The NVIDIA Visual Profiler (NVVP) profile timeline showing custom ranges and labels inserted using NVTX.
    Figure 1: The NVIDIA Visual Profiler (NVVP) profile timeline showing custom ranges and labels inserted using NVTX.

    The example had no GPU kernels, so the timeline isn’t very interesting. But you can use the same methodology to generate more complex traces with CPU and GPU markers, as Figure 2 shows.

    Figure 2: Profiler timeline for a more complex application showing CUDA Fortran kernels, API calls, and custom ranges inserted using NVTX.
    Figure 2: Profiler timeline for a more complex application showing CUDA Fortran kernels, API calls, and custom ranges inserted using NVTX.

    NVTX Fortran Module Code

    Following is the code for the nvtx module. The code uses the Fortran ISO C Binding module to create an interface to the NVTX C functions. It also uses the “optional” keyword to handle the custom color parameter.

    module nvtx
    
    use iso_c_binding
    implicit none
    
    integer,private :: col(7) = [ Z'0000ff00', Z'000000ff', Z'00ffff00', Z'00ff00ff', Z'0000ffff', Z'00ff0000', Z'00ffffff']
    character(len=256),private :: tempName
    
    type, bind(C):: nvtxEventAttributes
      integer(C_INT16_T):: version=1
      integer(C_INT16_T):: size=48 !
      integer(C_INT):: category=0
      integer(C_INT):: colorType=1 ! NVTX_COLOR_ARGB = 1
      integer(C_INT):: color
      integer(C_INT):: payloadType=0 ! NVTX_PAYLOAD_UNKNOWN = 0
      integer(C_INT):: reserved0
      integer(C_INT64_T):: payload   ! union uint,int,double
      integer(C_INT):: messageType=1  ! NVTX_MESSAGE_TYPE_ASCII     = 1 
      type(C_PTR):: message  ! ascii char
    end type
    
    interface nvtxRangePush
      ! push range with custom label and standard color
      subroutine nvtxRangePushA(name) bind(C, name='nvtxRangePushA')
      use iso_c_binding
      character(kind=C_CHAR,len=*) :: name
      end subroutine
    
      ! push range with custom label and custom color
      subroutine nvtxRangePushEx(event) bind(C, name='nvtxRangePushEx')
      use iso_c_binding
      import:: nvtxEventAttributes
      type(nvtxEventAttributes):: event
      end subroutine
    end interface
    
    interface nvtxRangePop
      subroutine nvtxRangePop() bind(C, name='nvtxRangePop')
      end subroutine
    end interface
    
    contains
    
    subroutine nvtxStartRange(name,id)
      character(kind=c_char,len=*) :: name
      integer, optional:: id
      type(nvtxEventAttributes):: event
    
      tempName=trim(name)//c_null_char
    
      if ( .not. present(id)) then
        call nvtxRangePush(tempName)
      else
        event%color=col(mod(id,7)+1)
        event%message=c_loc(tempName)
        call nvtxRangePushEx(event)
      end if
    end subroutine
    
    subroutine nvtxEndRange
      call nvtxRangePop
    end subroutine
    
    end module nvtx

    You can customize the interface to your needs. For example, you could add more colors, add a variant that includes a call to cudaDeviceSynchronize() to better mark GPU kernels, or add more functions from NVTX. You now have the power to use NVTX in Fortran code to better understand the hot spots in your applications.

    Related resources

    Discuss (4)
    0

    Tags

    人人超碰97caoporen国产