Buildbots are the backbone of LLVM’s post-commit testing infrastructure. The community maintains a large fleet of build machines ranging from quite common infrastructure, such as x86_64 on a debian system, to some more exotic things, like different MIPS architectures. Common is however that a commit that fails a production (I’ll explain in a second what that is) buildbot can be reverted to maintain “a green CI state” for the project. A revert should at least link to the broken build. Even better is to explain what was the failure observed to enable the original author to investigate the issue. Since buildbot is the name of the framework used, I will use the term “builder” for the rest of this post when referring to an actual instance that is performing the building and testing of LLVM.*

Production vs Staging Builders: A production builder is a machine that has proven in the past to be somewhat reliable and to not produce intermittent failures. Or if it does, produce sufficiently few of them. A staging builder on the other hand may be more unreliable, or a more experimental setup. The distinction is made since production builders automatically comment / send emails when they fail a build. To keep distraction of contributors to a minimum, they should be sufficiently reliable.

Why Set Up A builder?

There are a few reasons why you would want to set up a builder and add it to the LLVM post-commit test fleet. The primary reason that I can think of: You have a product (that includes hobbyist projects!) that relies on at least one component from the LLVM project in a certain configuration and you want to make sure that that configuration continues to build and work as expected. This is likely one of the motivations why companies maintain one or more LLVM builders. Other reasons include that you are an open-source enthusiast and want to support the project, or maybe enjoy playing around with these things. Independent of your motivation, if you want to set up a builder, the instructions on the LLVM documentation are easy and straight forward.

The one thing that is not covered in the documentation (at least I did not see it) is that an additional file needs to be updated for production builders should they be enabled to send emails to their maintainer email address in case of a broken build. You can check out this PR to see what is needed in order to enable a production builder to do so.

I’m not going to cover how to set up the actual builder instance on the machine but rather elaborate a bit on the lessons learned from maintaining a few builders for my day job.

Usable Builders With Actionable Feedback

When I started to maintain the builders for our team, I took care of a single builder. I inherited it from someone and so I mostly made it to continue to function. I then inherited another one. The second one, I took more responsibility to work on its coverage and expand on what it actually did. Then a few more came along. And at that point, I had to think about sustainability of my work, the “actual goals” we want to achieve and what is needed in order to do that. And I already got some feedback from the community. One of the most typical questions actually was: How can I reproduce the error?

That is a good question given that the builders are 1. bare metal, 2. custom installation, 3 non-accessible black boxes that give no feedback other than the stdout / stderr output from the bash script that performed the build. And that led me to think about how to make the whole builder more reproducible. I ended up with these two things that I need to do. And while these seem obvious, it wasn’t completely clear to me how crucial this is to enable contributors to reproduce build issues.

Containerize And Public Dockerfile: All newly created builders are containerized and the Dockerfile from which the container is built is made available here. This is important, since quite a few problems are caught in builder configurations that run on older operating systems and therefore compiler toolchains. An example for such a toolchain is SLES 15, which comes with a somewhat dated GCC 7.5 installation. Providing the Dockerfile used by the builder allows contributors to create and execute the same environment locally.

Use CMake Cache For Config: Buildbot offers different ways on how to create a build config for the project. These are stored in a separate repository and it is not always obvious for the general contributor to easily recreate the actual CMake invocation (let alone the whole system setup). This is why we are moving our build configs away from this secondary repository into CMake cache files that live inside the LLVM repository. It enables a contributor to more easily recreate the exact CMake configuration that the builder used when it detected a build error.

Taking these two steps together, a contributor can in a straight forward way set up the builder environment locally and perform the exact same command within the same environment. They do not need multiple repositories or machines. This is a huge benefit and eases the answer to the “how can I reproduce the error” question. And while I did not talk about it at all in this post, it also enables the use of the actual exact same config in pre-commit testing and potentially even in downstream projects that depend on LLVM.

I hope this was a somewhat interesting read and if you are curious, you can find me on the LLVM Discourse and/or Discord under the handle: jplehr.

*Technically, a builder is running on a worker. However, this can become quite confusing and there is a good talk (Youtube, Slides) by David Spickett about buildbots that helped me tremendously when I inherited my first builder.

No responses yet

Leave a Reply

Your email address will not be published. Required fields are marked *