On the hardware side, there is the hierarchy (fine to coarse):
All OpenMP and OpenACC levels are used, i.e.
The used sizes are
warp_size
is always 32
dim={#teams,1,1}, blocks={#threads,warp_size,1}
.
Additional information can be obtained by setting the environment variable to
GOMP_DEBUG=1
(very verbose; grep for kernel.*launch
for launch
parameters).
GCC generates generic PTX ISA code, which is just-in-time compiled by CUDA,
which caches the JIT in the user’s directory (see CUDA documentation; can be
tuned by the environment variables CUDA_CACHE_{DISABLE,MAXSIZE,PATH}
.
Note: While PTX ISA is generic, the -mptx=
and -march=
commandline
options still affect the used PTX ISA code and, thus, the requirements on
CUDA version and hardware.
The implementation remark:
printf
functions. Note that the Fortran
print
/write
statements are not supported, yet.
requires reverse_offload
requires at least -march=sm_35
, compiling for -march=sm_30
is not supported.
target
regions with
device(ancestor:1)
), there is a slight performance penalty
for all target regions, consisting mostly of shutdown delay
Per device, reverse offload regions are processed serially such that
the next reverse offload region is only executed after the previous
one returned.
unified_address
or unified_shared_memory
will remove any nvptx device from the
list of available devices (“host fallback”).