• <xmp id="om0om">
  • <table id="om0om"><noscript id="om0om"></noscript></table>
  • Simulation / Modeling / Design

    An OpenACC Example (Part 2)

    You may want to read?the more?recent post?Getting Started with OpenACC?by Jeff Larkin.

    In?my previous post?I added 3 lines of OpenACC directives to a Jacobi iteration code, achieving more than 2x speedup by running it on a GPU. In this post I’ll continue where I left off and demonstrate how we can use OpenACC directives clauses to take more explicit control over how the compiler parallelizes our code. This will provide us with significantly higher speedup.

    Example Source Code

    You can browse and download all source code from the examples in this post from the?Parallel Forall GitHub repository. The directory for this post has subdirectories for each “step” in this post (and the next one) with their own source and Makefiles so you can see and try out the changes in each step (step 1,?step 2,?step 3). The examples use?OpenACC 1.0 syntax, so you’ll need a compiler that supports it. ?I’m using the?PGI compiler?version 12.2 with preliminary support for OpenACC.

    Let’s pick up where we left off last week.

    STEP 3: Tuning Parallelization Configuration

    To refresh, this is what our C code looks like after?step 2?from the last post.

    1. #pragma acc data copy(A, Anew)
    2. while ( error > tol && iter < iter_max )
    3. {
    4. error = 0.f;
    5.  
    6. #pragma omp parallel for shared(m, n, Anew, A)
    7. #pragma acc kernels
    8. for( int j = 1; j < n-1; j++)
    9. {
    10. for( int i = 1; i < m-1; i++ )
    11. {
    12. Anew[j][i] = 0.25f * ( A[j][i+1] + A[j][i-1]
    13. + A[j-1][i] + A[j+1][i]);
    14. error = fmaxf( error, fabsf(Anew[j][i]-A[j][i]));
    15. }
    16. }
    17.  
    18. #pragma omp parallel for shared(m, n, Anew, A)
    19. #pragma acc kernels
    20. for( int j = 1; j < n-1; j++)
    21. {
    22. for( int i = 1; i < m-1; i++ )
    23. {
    24. A[j][i] = Anew[j][i];
    25. }
    26. }
    27.  
    28. if(iter % 100 == 0) printf("%5d, %0.6fn", iter, error);
    29.  
    30. iter++;
    31. }
    32.  
    33. double runtime = GetTimer();
    34.  
    35. printf(" total: %f sn", runtime / 1000.f);
    36. }

    Here’s the Fortran code.

    1. !$acc data copy(A, Anew)
    2. do while ( error .gt. tol .and. iter .lt. iter_max )
    3. error=0.0_fp_kind
    4.  
    5. !$omp parallel do shared(m, n, Anew, A) reduction( max:error )
    6. !$acc kernels
    7. do j=1,m-2
    8. do i=1,n-2
    9. Anew(i,j) = 0.25_fp_kind * ( A(i+1,j ) + A(i-1,j ) + &
    10. A(i ,j-1) + A(i ,j+1) )
    11. error = max( error, abs(Anew(i,j)-A(i,j)) )
    12. end do
    13. end do
    14. !$acc end kernels
    15. !$omp end parallel do
    16.  
    17. if(mod(iter,100).eq.0 ) write(*,'(i5,f10.6)'), iter, error
    18. iter = iter +1
    19.  
    20. !$omp parallel do shared(m, n, Anew, A)
    21. !$acc kernels
    22. do j=1,m-2
    23. do i=1,n-2
    24. A(i,j) = Anew(i,j)
    25. end do
    26. end do
    27. !$acc end kernels
    28. !$omp end parallel do
    29.  
    30. end do
    31. !$acc end data

    To begin, let’s run the C code from?step 2?on the GPU again, this time enabling?-ta=nvidia,time. We see output like the following from the PGI compiler.

    total: 8.940413 s
    
    Accelerator Kernel Timing data
    /home/mharris/src/parallel-forall/code-samples/posts/002-openacc-example/step2/laplace2d.c
      main
        92: region entered 1000 times
      time(us): total=2336562 init=298 region=2336264
    kernels=2295838 data=0
      w/o init: total=2336264 max=2441 min=2318 avg=2336
      95: kernel launched 1000 times
    grid: [256x256]  block: [16x16]
    time(us): total=2295838 max=2301 min=2293 avg=2295
    /home/mharris/src/parallel-forall/code-samples/posts/002-openacc-example/step2/laplace2d.c
      main
        80: region entered 1000 times
      time(us): total=6489988 init=190 region=6489798
    kernels=6184945 data=0
      w/o init: total=6489798 max=6960 min=6397 avg=6489
      83: kernel launched 1000 times
    grid: [256x256]  block: [16x16]
    time(us): total=6072138 max=6186 min=6012 avg=6072
      87: kernel launched 1000 times
    grid: [1]  block: [256]
    time(us): total=112807 max=221 min=111 avg=112
    /home/mharris/src/parallel-forall/code-samples/posts/002-openacc-example/step2/laplace2d.c
      main
        74: region entered 1 time
      time(us): total=8936004
    data=96980
    acc_init.c
      acc_init
        29: region entered 1 time
      time(us): init=82801

    We can see 3 regions. The first one is the “memcpy” loop nest starting on line 92, which takes about 2.3s out of 8.9s total. The second one is the main computation loop region starting on line 80, which takes about 6.5s. The third region mentioned is the enclosing while loop data region starting on line 74. We can see that this region takes 8.9s, which is good—that means there is no other part of the program that takes significant time. If we look at the main loop nests, we can see these lines:

    grid: [256x256]? block: [16x16]

    The terms?grid?and?block?come from the CUDA programming model. The GPU executes groups of threads called?thread blocks. To execute a kernel, the application launches a?grid?of these thread blocks. Each block runs on one of the GPU’s?multiprocessors?and?is assigned a certain range of IDs that it uses to address a unique data range. In this case our thread blocks have 256 threads each, arranged in 16×16 two-dimensional blocks. The grid is also 2D, 256 blocks wide and 256 blocks tall. This is just enough to cover our 4096×4096 grid. But we don’t really need that many blocks—if we tell the compiler to launch fewer, it will automatically generate a sequential loop over data blocks within the kernel code run by each thread.

    I have tried this and found that it is more efficient for this kernel. Also, blocks 32 threads wide execute this code faster than 16-wide blocks, because the GPU processes instructions 32 threads at a time (a group of 32 threads is called a?warp?in the CUDA architecture). To specify these dimensions, let’s modify the code for the main computational loop nest as follows.

    1. #pragma omp parallel for shared(m, n, Anew, A)
    2. #pragma acc kernels loop gang(32), vector(16)
    3. ??????? for( int j = 1; j < n-1; j++) {
    4. #pragma acc loop gang(16), vector(32)
    5. ??????????? for( int i = 1; i < m-1; i++ )

    The?gang(32)?clause on the outer loop tells the compiler to launch 32 blocks in the Y (row) direction.? Thegang(16)?clause on the inner loop tells it to launch 16 blocks in the X (column) direction.?The?vector(16)?clause on the outer loop tells the compiler to use blocks that are 16 threads tall, thus processing the loop iterations in SIMD groups of 16.?Finally, the?vector(32)?clause on the inner loop tells the compiler to use blocks that are 32 threads wide (one warp wide).?I also found that making a similar change to the copy loop nest benefits performance a small amount. In this case I found it best to only apply the?gang?and?vector?clauses to the inner loop, to ensure the blocks used are as wide as a warp. Here are the changes to the C code.

    1. #pragma omp parallel for shared(m, n, Anew, A)
    2. #pragma acc kernels loop
    3. ? ? ? ? for( int j = 1; j < n-1; j++) {
    4. #pragma acc loop gang(16), vector(32) ? ? ? ? ??
    5. ? ? ? ? ? ? for( int i = 1; i < m-1; i++ )

    The changes to the Fortran code are very similar. Here is the first loop nest.

    1. !$omp parallel do shared(m, n, Anew, A) reduction( max:error )
    2. !$acc kernels loop gang(32), vector(16)
    3. do j=1,m-2
    4. !$acc loop gang(16), vector(32)
    5. do i=1,n-2
    6. Anew(i,j) = 0.25_fp_kind * ( A(i+1,j ) + A(i-1,j ) + &
    7. A(i ,j-1) + A(i ,j+1) )
    8. error = max( error, abs(Anew(i,j)-A(i,j)) )
    9. end do
    10. !$acc end loop
    11. end do
    12. !$acc end kernels
    13. !$omp end parallel do

    And here is the second Fortran loop nest.

    1. !$omp parallel do shared(m, n, Anew, A)
    2. !$acc kernels loop
    3. do j=1,m-2
    4. !$acc loop gang(16), vector(32)
    5. do i=1,n-2
    6. A(i,j) = Anew(i,j)
    7. end do
    8. !$acc end loop
    9. end do
    10. !$acc end kernels
    11. !$omp end parallel do

    While we’re at it, let’s make one other small change. We don’t need to copy?Anew? to and from the GPU, since it is only accessed on the GPU.? We can modify the data region directive to specify that Anew is allocated only on the GPU using the create clause on the data directive.

    #pragma acc data copy(A), create(Anew)

    Here is the performance after these changes.

    Execution Time (s) Speedup vs. 1 CPU thread Speedup vs. 4 CPU threads
    CPU 1 thread 34.14
    CPU 4 threads 21.16 1.61x 1.0x
    GPU 5.32 6.42x 3.98x

    We’ve increased performance of this application by 4x (comparing CPU “socket” to GPU “socket”) by adding just a few lines of compiler directives!

    Wrapping Up

    OpenACC is a promising new standard that I think will be a powerful tool for high-productivity acceleration of existing C and Fortran code.?It will also be very valuable for prototyping; programmers can use it to determine if it is worthwhile to spend more time writing lower-level CUDA code for GPUs, for example. I hope you have found this overview interesting—please let me know what you think in the comments!?I’ll revisit OpenACC in a future post, possibly to discuss some of the challenges you might face in applying OpenACC to more complicated codes, or to provide some guidelines on getting the best results.

    If you are interested in trying OpenACC you can download?a free trial of PGI Accelerator?to try it out. The OpenACC version 1.0 specification is?available here.?A reference card and guides are available here.

    Discuss (2)
    0

    Tags

    人人超碰97caoporen国产