benchmarking – JP Lehr

4. July 202022. May 2022

Determine max. memory consumption of process

In a recent project, we needed to measure the increase in memory consumption for an application process. How to obtain “the right values” for this depends on the actual scenario and, apparently, is not straight forward in all cases.

Let me first describe the scenario a little more: We want to obtain measurements for both fully serial and MPI parallel applications. These applications are run in (1) an unchanged (vanilla), (2) an instrumented version (version 1) and (3) a version, which uses LD_PRELOAD to sneak-in another library that overloads MPI functions to do additional work (version 2).

More precisely what we want:

A way to obtain reliable measurements for the different configurations, as we are interested in the additional amount of memory we need in version 1 and version 2, when compared to vanilla.
The max memory consumption at runtime, not regarding potential /swap memory.
We are only interested running on a Linux operating system

Eventually, we used the rusage feature. The returned struct offers different fields related to memory. We found that for our use case, the correct value was to use the maximum resident set size (max RSS). This proved to be reliable and reasonable compared to manual calculations of the memory we assumed we require. An example code is given below.

#include <sys/resource.h>
#include <stdio.h>
#include "mpi.h"

/* Needs to be called at the end of the process */
int MPI_Finalize() {
  int rank;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  // We assume only MPI root should output memory consumption
  if (rank == 0) {
    struct rusage r;
    getrusage(RUSAGE_SELF, &r);
    printf("MAX RSS: %ld\n", r.ru_maxrss);
  }
  // ... Call to PMPI for actual MPI_Finalize
}

10. January 201822. May 2022

papi-wrap now public

I took some time on my last day of vacation to finish the refactorings I wanted to do on the PAPI wrapper that I mentioned in a previous post. Although I am sure that there is lots of things to clean up in this rather small code base, I made it publicly available!

It was used to generate the measurement results in my paper about the influence of measurement infrastructures, available in the ACM digital library.

The library was intended as an easy-to-use PAPI interface for C++ codes. It can be used as a library to be integrated in your code or it can be used as an external measurement routine using libmonitor. I may continue to work on this library in my free time as I do have some more ideas and want to integrate two features. One, implement a more structured way to output the measurement results. Two, have it not only count PAPI events, but also have it provide simple timer mechanisms.

If you are interested in this project, you can go to the papi-wrap on my github and download the source, build it and play around with it.

6. January 201822. May 2022

Interactive shell with SLURM

I just discovered a half-broken blueprint script that was supposed to open an interactive bash session within a newly allocated SLURM job. I typically allocate interactive sessions when I want to test a specific benchmark configuration on a particular machine or type of machine.

I always forget the exact command, so here is a fixed, i.e. working for me, line:

srun -n 1 --mem-per-cpu=100 -t 10:00 --pty bash -i

The line will have SLURM allocate a new resource with 1 task (-n 1) and 100 mb of memory (–mem-per-cpu 200). The job will live for 10 minutes (-t 10:00) and start a bash within it. I frequently also add the SLURM flag for exclusivity (–exclusive).

Please be aware that if your compute center operates with compute quotas the exclusivity will result in increased compute time consumed. Since you are practically allocating all machines for your own, you also occupy all CPUs. As a result, independent of the number of CPUs your job actually uses, the whole machine will be accounted, i.e. #number_of_cores * runtime_of_job.