12.1 AMD Radeon (GCN) ¶
On the hardware side, there is the hierarchy (fine to coarse):
- work item (thread)
- wavefront
- work group
- compute unit (CU)
All OpenMP and OpenACC levels are used, i.e.
- OpenMP’s simd and OpenACC’s vector map to work items (thread)
- OpenMP’s threads (“parallel”) and OpenACC’s workers map
to wavefronts
- OpenMP’s teams and OpenACC’s gang use a threadpool with the
size of the number of teams or gangs, respectively.
The used sizes are
- Number of teams is the specified
num_teams
(OpenMP) or
num_gangs
(OpenACC) or otherwise the number of CU. It is limited
by two times the number of CU.
- Number of wavefronts is 4 for gfx900 and 16 otherwise;
num_threads
(OpenMP) and num_workers
(OpenACC)
overrides this if smaller.
- The wavefront has 102 scalars and 64 vectors
- Number of workitems is always 64
- The hardware permits maximally 40 workgroups/CU and
16 wavefronts/workgroup up to a limit of 40 wavefronts in total per CU.
- 80 scalars registers and 24 vector registers in non-kernel functions
(the chosen procedure-calling API).
- For the kernel itself: as many as register pressure demands (number of
teams and number of threads, scaled down if registers are exhausted)
The implementation remark:
- I/O within OpenMP target regions and OpenACC parallel/kernels is supported
using the C library
printf
functions and the Fortran
print
/write
statements.
- Reverse offload (i.e.
target
regions with
device(ancestor:1)
) are processed serially per target
region
such that the next reverse offload region is only executed after the previous
one returned.
- OpenMP code that has a requires directive with
unified_address
or
unified_shared_memory
will remove any GCN device from the list of
available devices (“host fallback”).
- The available stack size can be changed using the
GCN_STACK_SIZE
environment variable; the default is 32 kiB per thread.