Next: , Previous: Metadata Optimization, Up: Common features


3.3 OpenMP Threading

Availability: ncbo, ncea, ncecat, ncflint, ncpdq, ncra, ncrcat, ncwa
Short options: ‘-t
Long options: ‘--thr_nbr’, ‘--threads’, ‘--omp_num_threads
NCO supports shared memory parallelism (SMP) when compiled with an OpenMP-enabled compiler. Threads requests and allocations occur in two stages. First, users may request a specific number of threads thr_nbr with the ‘-t’ switch (or its long option equivalents, ‘--thr_nbr’, ‘--threads’, and ‘--omp_num_threads’). If not user-specified, OpenMP obtains thr_nbr from the OMP_NUM_THREADS environment variable, if present, or from the OS, if not.

NCO may modify thr_nbr according to its own internal settings before it requests any threads from the system. Certain operators contain hard-code limits to the number of threads they request. We base these limits on our experience and common sense, and to reduce potentially wasteful system usage by inexperienced users. For example, ncrcat is extremely I/O-intensive so we restrict thr_nbr <= 2 for ncrcat. This is based on the notion that the best performance that can be expected from an operator which does no arithmetic is to have one thread reading and one thread writing simultaneously. In the future (perhaps with netCDF4), we hope to demonstrate significant threading improvements with operators like ncrcat by performing multiple simultaneous writes.

Compute-intensive operators (ncwa and ncpdq) are expected to benefit the most from threading. The greatest increases in throughput due to threading will occur on large dataset where each thread performs millions or more floating point operations. Otherwise, the system overhead of setting up threads may outweigh the theoretical speed enhancements due to SMP parallelism. However, we have not yet demonstrated that the SMP parallelism scales well beyone four threads for these operators. Hence we restrict thr_nbr <= 4 for all operators. We encourage users to play with these limits (edit file nco_omp.c) and send us their feedback.

Once the initial thr_nbr has been modified for any operator-specific limits, NCO requests the system to allocate a team of thr_nbr threads for the body of the code. The operating system then decides how many threads to allocate based on this request. Users may keep track of this information by running the operator with dbg_lvl > 0.

By default, operators with thread attach one global attribute to any file they create or modify. The nco_openmp_thread_number global attribute contains the number of threads the operator used to process the input files. This information helps to verify that the answers with threaded and non-threaded operators are equal to within machine precision. This information is also useful for benchmarking.