OpenMP Memo

feature

  • shared-memory MIMD(Multi-Instruction Multi Data).
  • Need support from the compiler
  • simple to code (small change to serialized program)

words

  • Directive-based shared memory: allows multiple processing units to access a common memory space using compiler directives. It uses pragmas(compiler hints). (If pragmas are supported, then it will be parallelized; otherwise, it will still be serialized.)

Header APIs

# pragma omp parallel num_threads(thread_count)
  Hello()
# You DO NOT have to specify how many threads should be here.
# It will utilize all the cores if you didn't specify the number of cores.

# This comes from omp.h header file
int my_rank = omp_get_thread_num();
int thread_count = omp_get_num_threads();

Critical

To avoid the race condition, when we aggregate values, we need critical directive

# pragma omp critical
  global_result += my_result;

Reduction clause

use reduction clause to the parallel directive. syntax of reduction clause: reduction(<operator>: <variable list>)

	global _ result = 0.0;
#   pragma omp parallel num_threads(thread_count) \
		reduction(+: global_result)
	global_result += local_trap(double a, double b, int n);

This is the supplement of the following code:

# wrong one: this will lead to a serialized runtime.
	global_result = 0.0;
#   pragma omp parallel num_threads(thread_count)
	{
#      progma omp critical  
		global_result += local_trap(double a, double b, int n);
	}

# trivial one:

	global_result = 0.0;
#   pragma omp parallel num_threads(thread_count)
	{
		double my_result = 0.0
		my_result += local_trap(double a, double b, int n);
#.  pragma omp critical
		global_result += my_result. 
	}

subtraction

OpenMP treats reduction(-:var) as a special case to avoid non-associativity.

floating

float can be different when reduction.

private clause

make a default variable private(make a copy each thread)

#   pragma omp parallel for num_threads(tread_count) \
        reduction(+:sum) private(factor)    

the print of a private variable is not specified.

pragma omp parallel num_threads(thread_count) private(x)
{
	int my_rank = omp_get_thread_num();
	#no specified since x is not initialized
	printf("Thread %d > before initialization. x = %d\n", my_rank, x);
	x = 2 * my_rank + 2;
	# ok
	printf("Thread %d > after initialization. x = %d\n", my_rank, x);
}
# not specified, since x is private, and we print it after parallel block
printf("After parallel block, x = %d\n", x);

For directive

feature

  • must be after a parallel directive
  • the thread created by parallel directive will split the for loop equally to execute.
  • the variable i is not share, each thread get their own copy.
    #pragma omp parallel num_threads(4)  // Create 4 threads
    {
        int thread_id = omp_get_thread_num();

        #pragma omp for  // Split loop iterations among the 4 threads
        for (int i = 0; i < 8; i++) {
            printf("Thread %d processes iteration %d\n", thread_id, i);
        }
    }
    return 0;
	>>> 
	Thread 0 processes iteration 0  
	Thread 0 processes iteration 1  
	Thread 1 processes iteration 2  
	Thread 1 processes iteration 3  
	Thread 2 processes iteration 4  
	Thread 2 processes iteration 5  
	Thread 3 processes iteration 6  
	Thread 3 processes iteration 7  


# example 2
double Trap(double a, double b, int n, int thread_count) {
   double  h, approx;
   int  i;
   # here i is declare outside the parallel directive scope
   # but it will be modifies to a private, to make sure the
   # correctness of parallel for.

   h = (b-a)/n; 
   approx = (f(a) + f(b))/2.0; 
#  pragma omp parallel for num_threads(thread_count) \
      reduction(+: approx)
   for (i = 1; i <= n-1; i++)
     approx += f(a + i*h);
   approx = h*approx; 

   return approx;
}  /* Trap */

Little summary about parallel and for

parallel

  • When placed before a block, it creates a team of threads.
  • Every thread executes the entire block (redundantly, unless work is split explicitly).
#pragma omp parallel  // All threads run this block
{
	printf("Hello from thread %d\n", omp_get_thread_num());
}
#or
#  pragma omp parallel num_threads(thread_count) 
   Trap(a, b, n, &global_result);

for

  • Must be nested inside a parallel block (or combined with parallel for).
  • Splits loop iterations across threads (work-sharing).
#pragma omp parallel  // Team of threads
{
	#pragma omp for  // Loop iterations divided
	for (int i = 0; i < 10; i++) {
		printf("%d by thread %d\n", i, omp_get_thread_num());
	}
}

#pragma omp parallel for  // Team of threads + split loop
for (int i = 0; i < 10; i++) {
	printf("%d by thread %d\n", i, omp_get_thread_num());

Alert

  • while or do-while will not be parallelized
  • only the for loop with number of iterations can be determined
  • no exit point in the for loop, like return or break
  • only deal with canonical form: for(index=start: index <:<=:>=:> end; index++:++index:index–:…)
  • index in for loop can only be modified by increment expression in for statement (these restrictions allow the run-time system to determine the number of iterations prior to execution of loop)
  • no dependencies are allowed between element of array. like: fibo[i] = fibo[i-1] + fibo[i-2];
  • OpenMp compiler don’t check for dependences among iterations in loop.
  • A loop in which the results of one or more iterations depend on other iterations cannot, in general, be correctly parallelized by OpenMP without features like Tasking API.
  • the dependency like fibo, is called a loop-carried dependence

Default(none) clause

If you add this cluase, you need to specify all the variable scope in the block.

Odd-even transposition sort

We can reuse threads we created this way:

#   pragma omp parallel num_threads(thread_count) \
		default(none) shared(a, n) private(i, tmp, phase)
	for (phase = 0; phase < n; phase++) {
		if (phase % 2 == 0)
			pragma omp for
			for (i = 1; i< n; i += 2) {
				if (a[i-1] > a[i]){
					//swamp
				}
			}
		# barrier here
		else
			pragma omp for
			for(i = 1; i < n - 1; i += 2){
				if (a[i] > a[i+1]) {
					//swap
				}
			}
	}

Schedule clause

schedule(<type>, [,<chunksize>])

If we want work-load balance, when come across work-load distributes according to index, we can use cylic schedule, i.e. schedule.

sum = 0.9;
#  pragma omp parallel for num_threads(thread_count) \
       reduction(+:sum) schedule(static, 1) 
	for(i = 0; i <= n; i++)
		sum += f(i);
  • static. assigned to threads before loop is executed
  • dynamic or guided. assigned while executing. so after thread completes its current set of iterations, it can request more.
  • auto. compiler or run-time determine the schedule
  • runtime. schedule is determined at run-time based on an environment variable.

Comparison

  • static is good for each iteration takes roughly the same amount of time; and can improve the speed of memory accesses
  • in guided schedule, as chunks are completed, the size of the new chunks decreases. good for when later iterations are more compute-intensive.

runtime

takes type from env variable (static, dynamic or guided)

Atomic Directive

#   pragma omp atomic
	x += expression;

can be many thing, like + - * /, ++, etc.

Comparison between critical, lock and atomic

  • atomic is equivalent to using unnamed critical, s.t. it can block different part at the same time.
  • using unnamed critical need to watch out the dependencies between different part, it can be ok to parallel sometimes.
  • use named critical can be hard to implement, since you need explicitly name all the critical directive, so you can use lock example code:
#pragma omp critical(phase1)
{/*code A*/}

#pragma omp critical(phase2)
{/*code A*/}


omp_set_lock(&q_p->lock);
Enqueue(q_p, my_rank, mesg);
omp_unset_lock(&q_p->lock);
# omp_init_lock(lock);
# omp_destroy)lock(lock);

Alert

  • different type of mutual exclusion for a single critical section can be bad. like you use atomic and use critical, they can lead to misunderstanding cause they will exclude each other.
  • no fairness is guaranteed in a exclusion. Some thread can wait very long time to get access
  • nest exclusion in exclusion can lead to deadlock

Critical(name) and lock

Trapezoidal rule

the approximation of integration of f from a to b: $h[f(x_0)/2 + f(x_1) + \cdots + f(x_{n-1})+f(x_{x_n})/2]$

h = (b-a)/n;
approx = (f(a) + f(b))/2.0;
for (i = 1; i <= n - 1; i++) {
	x_i = a + i * h;
	approx += f(x_i);
}
approx = h * approx;
# omp one
h = (b-a)/n;
local_n = n / thread_count;
local_a = a + my_rank * local_n * h;
local_b = local_a + local_n * h;
approx = (f(local_a) + f(local_b)) / 2.0;
for (i = 1; i <= local_n - 1; i++) {
	x = local_a + i * h;
	approx += f(x);
}
approx = approx * h;
#critical here
global_approx += approx;

Details

  • There’s an implicit barrier. Threads in a team will wait for the completion of other members in the team before they terminate.
  • To check OPENMP is supported or not, we need:
    #ifdef _OPENMP
    #    include <omp.h>
    #endif
    

Some formula

Efficiency $E = \frac{S}{t}$ Speedup $S = (\frac{T_{serial}}{T_{parallel}})$

TODO

  • defualt cluase
  • bubble sort
  • odd-even transposition sort
  • scheduling loops
  • 1
  • 1