MetaCG Development Details Mini Series

This is the first post of a mini series that I thought may be interesting to do. It will go a little into the details of our development model for the MetaCG library and their reasons. The things we do here are obviously by no means “the” right thing to do for every project, but maybe the perspective on the different topics is relevant for you and your decision making for your own project. The series will consist of three parts.

  • Branching Model: In the first part we provide an overview of our branching model and why we chose it.
  • Development and Testing: The second part will go over our development work. This is probably the largest part of the mini series and may be split up into multiple parts itself.
  • Release Management: In the third part we provide an overview of our release management, i.e., which releases we do, how they are prepared and how they are finalized.

However, let’s first have some history on MetaCG as this will be the reason for some of the decisions that were and are taken.

MetaCG — History

MetaCG started (probably late 2013) as a nameless mock-tool for smart instrumentation tools and heuristics within our work on the InstRO tool. It was used to construct the call graph for a program given an unfiltered Score-P profile. An unfiltered Score-P profile is a runtime profile recorded with the Score-P tool that contains all call edges from the original source code executed at runtime. Given this call graph, the tool would then evaluate the heuristics that were published in the paper “Calltree-Controlled Instrumentation for Low-Overhead Survey Measurements”.

Then, during my PhD time, I evolved the code base of the nameless mock-tool starting in 2017 into what was then referred to as MetaCG. The code base was, however, still almost exclusively focused on the use within the PIRA tool and our work on iterative performance instrumentation refinement that we published in our paper on PIRA.

After our paper on “Automatic Instrumentation Refinement for Empirical Performance Modeling”, we decided to evolve MetaCG into its own tool that can be used more easily outside of the PIRA context. The idea behind this was to ease our research tools development should we want or need to pass information from the source level to the LLVM level. We wrote a paper about MetaCG and used the now-available whole-program call-graph information for an extended analysis in our paper “Towards compiler-aided correctness checking of adjoint MPI applications”.

Since then we have worked on the MetaCG code base to separate it more and more from the initial use case within PIRA. This means that it is now evolving into a set of libraries and tools that build on top of these libraries. One tool is the (also evolving) analyzer used within PIRA and a more recently created tool is the analyzer built as part of the CaPI project.

This evolution of the MetaCG code base has lead to several significant changes and re-designs. Many of these changes were really motivated and necessary while we evolve from a special-purpose mock-up tool, to a prototype tool for one particular context, to a set of libraries for tool construction. What I learned in the process may also make it into an article on this website eventually.

As one important side note, coming from academia, the possibility for (a) working on paper ideas and (b) students to work on parts of MetaCG for a thesis was and is very important. This can be seen in the branching model and other areas of the tool and library throughout the mini-series.

I’m now on twitch!

Yes that’s right! I’m now on twitch!

I have lately started to stream some programming and building games live on twitch.tv/jplehr and found that quiet relaxing and nice. So I want to continue it with a schedule. I’m happy if people join-in to chat, learn more about the software that I work on, maybe about programming in general, or whatever we want to chat about.

Wednesday from 09 PM CET: Programming

Every Wednesday from 09.00 pm CET to about midnight, I am continuing on one of the software packages that I introduce on this website. Currently, I am mostly working on MetaCG or PIRA. Both can also be found on Github.

Sunday from 09 PM CET: Open Stream

I’ll probably also stream on Sunday’s, from 09.00 pm CET to about midnight. But on Sundays, it’ll not only be programming. Other stuff I want and can stream, given the current computers I have, are OpenTTD or Cities Skylines. So some relaxing building and development games. That may change obviously. 😉

So, if you want to know what’s coming up, follow me on twitter or twitch and receive updates and notifications on what I’ll be doing.

MetaCG – Annotated Whole-Program Call-Graphs

MetaCG: A tool suite for whole-program inspection and automatic performance instrumentation generation. It brings a call-graph extractor using Clang Tooling, a whole-program call-graph library that is serializable to a json-based format and allows annotation with user-defined meta data, hence the name MetaCG. Finally, a call-graph validation tool that uses a Score-P profile to determine which edges are missing.

Repository: https://github.com/tudasc/metacg

In our work on PIRA, we realized that we need a whole-program call-graph representation that we can analyze and annotate with user-defined information. There are obviously multiple ways to do that, and we decided (more or less well-informed) to implement it as a library together with a toolchain to extract the call graph from C/C++ code using Clang Tooling. To evaluate its completeness we figured that it is easiest for us to use instrumentation-based profiling data using Score-P. This is a dependence of PIRA anyway, and at that time the call-graph library was only used within PIRA. Later on, we realized that we want to use whole-program reachability information in other tools as well, and think that the call-graph library of PIRA is a reasonable abstraction.

So we started MetaCG as a more general software package.

The software package is written in C++, uses the CMake build system and is licensed under a BSD 3-clause license. It comes with five software components.

MetaCG Library: The fundamental call-graph representation. A lightweight, bi-directional graph of which the function nodes can hold user-defined information in MetaContainers. The graph can be serialized into json, in which case the MetaContainers are output to every function node with a specified key such that they can be identified in the json file later on, e.g., by a subsequent analysis tool. Currently, it does not contain explicitly modeled edges, which limits its expressiveness to some extent. However, this is a feature that is planned and will be added when time permits.

CGCollector: The Clang-based call-graph extractor. It processes the abstract syntax tree and obtains information about the class hierarchy, call relations, and other source-level information that a user needs. The latter is done through the MetaCollector extension point, i.e., for every source information that should be annotated, a new MetaCollector is derived, obtains the desired information, and attaches it for a specific tool to the MetaCG.

CGMerge: CGCollector works on a single translation unit at a time, hence, the partial call graphs need to be merged. This is done, similar to linking for a binary, with CGMerge. It takes all translation-unit local files and merges them. It needs some strategy to resolve potential multiple entries in the meta data, i.e., data generated from a MetaCollector, fields, hence, the user is required to provide them.

CGValidate: The tool gets a MetaCG and a Score-P profile in Cube format (please note, it needs to be a full profile, i.e., with all functions marked as inline etc), and checks which edges are not present in the MetaCG. This allows a user to validate that all potential function calls are contained, and if not, CGValidate can patch the missing edges into the MetaCG.

PGIS: The PIRA analyzer that performs call-graph analysis to generate low-overhead performance instrumentation for subsequent measurement with Score-P.

If you are curious, please check it out, and report issues and bugs in the issue tracker on Github. I also plan to write more articles here that explain some components or use-cases in more detail.

The development currently takes place in a university-hosted Gitlab instance, hence, not every feature that is being worked on is already public. Should you be interested in our progress or even in contributing to the project, please also open an issue on Github and we can figure out how you get access.

Determine max. memory consumption of process

In a recent project, we needed to measure the increase in memory consumption for an application process. How to obtain “the right values” for this depends on the actual scenario and, apparently, is not straight forward in all cases.

Let me first describe the scenario a little more: We want to obtain measurements for both fully serial and MPI parallel applications. These applications are run in (1) an unchanged (vanilla), (2) an instrumented version (version 1) and (3) a version, which uses LD_PRELOAD to sneak-in another library that overloads MPI functions to do additional work (version 2).

More precisely what we want:

  • A way to obtain reliable measurements for the different configurations, as we are interested in the additional amount of memory we need in version 1 and version 2, when compared to vanilla.
  • The max memory consumption at runtime, not regarding potential /swap memory.
  • We are only interested running on a Linux operating system

Eventually, we used the rusage feature. The returned struct offers different fields related to memory. We found that for our use case, the correct value was to use the maximum resident set size (max RSS). This proved to be reliable and reasonable compared to manual calculations of the memory we assumed we require. An example code is given below.

#include <sys/resource.h>
#include <stdio.h>
#include "mpi.h"

/* Needs to be called at the end of the process */
int MPI_Finalize() {
  int rank;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  // We assume only MPI root should output memory consumption
  if (rank == 0) {
    struct rusage r;
    getrusage(RUSAGE_SELF, &r);
    printf("MAX RSS: %ld\n", r.ru_maxrss);
  }
  // ... Call to PMPI for actual MPI_Finalize
}

PIRA – a framework for iterative instrumentation refinement

The main software project I was working on through the last weeks and months is PIRA – the Performance Instrumentation Refinement Automation framework. It is available at https://github.com/jplehr/pira. It is the first software I have set up and used continuous integration for. However, for some historic reason, all components are split up into several repositories and the release “process” used for the initial release is a mess.
(Hint: the currently available version doesn’t work, because I missed something when I released it.)

The next release, using a better release process, is scheduled for August 1st.

Anyway – What is PIRA?

The framework can assist performance analysts and computer scientists to discover performance characteristics of their, or someone else’s, C and C++ software using Score-P. PIRA uses a combination of static and dynamic analysis to iteratively adapt an instrumentation configuration, i.e., which functions should be instrumented for measurement or analysis.

The main driver is written in Python 3. The analysis and instrumentation components are separated into an analysis tool and metric collectors built on top of Clang/LLVM. The final measurements are performed using the Score-P measurement infrastructure.

For those interested, there are two research papers available: (i) about the framework and (ii) a use case, in which we used PIRA to automatically reduce the number of functions passed to the empirical performance modeling tool Extra-P.

What is going to come?

In the next weeks I’ll write some notes about how to use PIRA for your own purposes and what I did when setting up my Gitlab CI instances.

Next Release: August 1st

The next PIRA release is planned for August 1st. It includes new features, such as automatic MPI-function filtering, configurable rebuild intervals, and better-to-use configuration files.

papi-wrap now public

I took some time on my last day of vacation to finish the refactorings I wanted to do on the PAPI wrapper that I mentioned in a previous post. Although I am sure that there is lots of things to clean up in this rather small code base, I made it publicly available!

It was used to generate the measurement results in my paper about the influence of measurement infrastructures, available in the ACM digital library.

The library was intended as an easy-to-use PAPI interface for C++ codes. It can be used as a library to be integrated in your code or it can be used as an external measurement routine using libmonitor.  I may continue to work on this library in my free time as I do have some more ideas and want to integrate two features. One, implement a more structured way to output the measurement results. Two, have it not only count PAPI events, but also have it provide simple timer mechanisms.

If you are interested in this project, you can go to the papi-wrap on my github and download the source, build it and play around with it.

Interactive shell with SLURM

I just discovered a half-broken blueprint script that was supposed to open an interactive bash session within a newly allocated SLURM job. I typically allocate interactive sessions when I want to test a specific benchmark configuration on a particular machine or type of machine.

I always forget the exact command, so here is a fixed, i.e. working for me, line:

srun -n 1 --mem-per-cpu=100 -t 10:00 --pty bash -i

The line will have SLURM allocate a new resource with 1 task (-n 1) and 100 mb of memory (–mem-per-cpu 200). The job will live for 10 minutes (-t 10:00) and start a bash within it. I frequently also add the SLURM flag for exclusivity (–exclusive).

Please be aware that if your compute center operates with compute quotas the exclusivity will result in increased compute time consumed. Since you are practically allocating all machines for your own, you also occupy all CPUs. As a result, independent of the number of CPUs your job actually uses, the whole machine will be accounted, i.e. #number_of_cores * runtime_of_job.