Determine max. memory consumption of process

In a recent project, we needed to measure the increase in memory consumption for an application process. How to obtain “the right values” for this depends on the actual scenario and, apparently, is not straight forward in all cases.

Let me first describe the scenario a little more: We want to obtain measurements for both fully serial and MPI parallel applications. These applications are run in (1) an unchanged (vanilla), (2) an instrumented version (version 1) and (3) a version, which uses LD_PRELOAD to sneak-in another library that overloads MPI functions to do additional work (version 2).

More precisely what we want:

  • A way to obtain reliable measurements for the different configurations, as we are interested in the additional amount of memory we need in version 1 and version 2, when compared to vanilla.
  • The max memory consumption at runtime, not regarding potential /swap memory.
  • We are only interested running on a Linux operating system

Eventually, we used the rusage feature. The returned struct offers different fields related to memory. We found that for our use case, the correct value was to use the maximum resident set size (max RSS). This proved to be reliable and reasonable compared to manual calculations of the memory we assumed we require. An example code is given below.

#include <sys/resource.h>
#include <stdio.h>
#include "mpi.h"

/* Needs to be called at the end of the process */
int MPI_Finalize() {
  int rank;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  // We assume only MPI root should output memory consumption
  if (rank == 0) {
    struct rusage r;
    getrusage(RUSAGE_SELF, &r);
    printf("MAX RSS: %ld\n", r.ru_maxrss);
  }
  // ... Call to PMPI for actual MPI_Finalize
}

PIRA – a framework for iterative instrumentation refinement

The main software project I was working on through the last weeks and months is PIRA – the Performance Instrumentation Refinement Automation framework. It is available at https://github.com/jplehr/pira. It is the first software I have set up and used continuous integration for. However, for some historic reason, all components are split up into several repositories and the release “process” used for the initial release is a mess.
(Hint: the currently available version doesn’t work, because I missed something when I released it.)

The next release, using a better release process, is scheduled for August 1st.

Anyway – What is PIRA?

The framework can assist performance analysts and computer scientists to discover performance characteristics of their, or someone else’s, C and C++ software using Score-P. PIRA uses a combination of static and dynamic analysis to iteratively adapt an instrumentation configuration, i.e., which functions should be instrumented for measurement or analysis.

The main driver is written in Python 3. The analysis and instrumentation components are separated into an analysis tool and metric collectors built on top of Clang/LLVM. The final measurements are performed using the Score-P measurement infrastructure.

For those interested, there are two research papers available: (i) about the framework and (ii) a use case, in which we used PIRA to automatically reduce the number of functions passed to the empirical performance modeling tool Extra-P.

What is going to come?

In the next weeks I’ll write some notes about how to use PIRA for your own purposes and what I did when setting up my Gitlab CI instances.

Next Release: August 1st

The next PIRA release is planned for August 1st. It includes new features, such as automatic MPI-function filtering, configurable rebuild intervals, and better-to-use configuration files.

papi-wrap now public

I took some time on my last day of vacation to finish the refactorings I wanted to do on the PAPI wrapper that I mentioned in a previous post. Although I am sure that there is lots of things to clean up in this rather small code base, I made it publicly available!

It was used to generate the measurement results in my paper about the influence of measurement infrastructures, available in the ACM digital library.

The library was intended as an easy-to-use PAPI interface for C++ codes. It can be used as a library to be integrated in your code or it can be used as an external measurement routine using libmonitor.  I may continue to work on this library in my free time as I do have some more ideas and want to integrate two features. One, implement a more structured way to output the measurement results. Two, have it not only count PAPI events, but also have it provide simple timer mechanisms.

If you are interested in this project, you can go to the papi-wrap on my github and download the source, build it and play around with it.

Interactive shell with SLURM

I just discovered a half-broken blueprint script that was supposed to open an interactive bash session within a newly allocated SLURM job. I typically allocate interactive sessions when I want to test a specific benchmark configuration on a particular machine or type of machine.

I always forget the exact command, so here is a fixed, i.e. working for me, line:

srun -n 1 --mem-per-cpu=100 -t 10:00 --pty bash -i

The line will have SLURM allocate a new resource with 1 task (-n 1) and 100 mb of memory (–mem-per-cpu 200). The job will live for 10 minutes (-t 10:00) and start a bash within it. I frequently also add the SLURM flag for exclusivity (–exclusive).

Please be aware that if your compute center operates with compute quotas the exclusivity will result in increased compute time consumed. Since you are practically allocating all machines for your own, you also occupy all CPUs. As a result, independent of the number of CPUs your job actually uses, the whole machine will be accounted, i.e. #number_of_cores * runtime_of_job.