TP Parallel Architectures, PRCD, 2015-2016

Latency of instructions and number of pipelines

  1. Download the mbench tool, compile it. Benchmarks are in bin directory, sources of the kernels are in src/kernel.
  2. Execute kernel add -s 1MB. The latency that shows at the end is the latency for 1 add instruction.
  3. Modify the kernel add.c so that add instructions are independent. You need to use more registers.
  4. Execute kernel local_load -s xxKB where xx is the size of an array to load. Write a script that executes the kernel for different sizes, put the results in a file for gnuplot and display how performance changes. Explain how performance changes.
  5. Modify local_load.c so that one instruction out of 2 is a load, one instruction out of 2 is an add. What can you observe ? What can you deduce ?


For this experiment, you may need to look at the manual of the Intel C Compiler for intrinsics, or the ICC Manual, or GCC manual for vector extensions.
  1. Download the TSC benchmark, compile it (no optimization flag).
  2. Vectorize the code of the function s000 with GCC extensions. What are the performance ?
  3. Unroll the loop, measure performance.
  4. Can you vectorize loop s221, why ? Find a transformation to vectorize the code.
  5. Find a transformation to vectorize function s251.
  6. Find a transformation to vectorize function s254
  7. Consider the original version of TSC. Compile now with vectorization flags (-O2 -ftree-vectorize -mavx for GCC, -O2 -axAVX for ICC), and generate a vectorization report (-ftree-vectorizer-verbose for GCC, -vec-report=n for a vectorization report). Analyze it.

Data layout optimization of Stencil

Download the Jacobi code. Compile it and execute it. The image produced is in jacobi.ppm.
  1. In the jacobi function, count the ratio of load/store and the number of floating point operations. Is the code memory or compute bound ?
  2. Now count the total size of data accessed, and compare it with the total number of computations, assuming each data is loaded/stored only once. Same question.
  3. Optimize the code for its memory layout, and to improve data locality (temporal and/or spacial). You can use loop interchange, change data allocation, perform tiling, unrolling.
  4. Measure with perf the number of cache misses of each version of your codes
  5. Vectorize the jacobi function
  6. Use threads to parallelize it with OpenMP. Parallelize equally the other loops in the while loop.
Download the PLOSOne archive. Decompress it and compile one version of the code. Execute it.
  1. Perform profiling
  2. Determine how to optimize it, what would be the most profitable optimization ?
  3. Perform this optimization