In my talk at the FOSDEM LLVM devroom this year, I mentioned that for my dayjob I inherited the care-taking of a single LLVM buildbot, which then expanded to another buildbot and another one. While handling the initial three bots was done manually without much problems, it became more cumbersome when additional bots were brought online. At this point it became clear that some automation was needed and I started to look which technologies exist. In this post, I’ll go a little into what i ended up doing for the automation. If you are more interested in the thoughts behind the actual Buildbot setup, you want to check out this blog post instead.
The machine fleet runs several nodes equipped with 2 AMD 32/64 (cores/threads) CPUs, half a Terabyte of RAM and a single AMD Instinct MI210 GPU. These machines run a mix of bare-metal buildbots and containerized buildbots, with 4 containers sharing a single machine. The fleet also runs some older machines which are equipped with two smaller core-count Intel CPUs and AMD Instinct MI100 GPUs.
Technology Choice
Given that I very loosely followed what some of our DevOps teams do and use for their automation tasks, I relatively quickly settled on starting to go through the documentation / getting-started guide for Ansible. This was mainly informed by (a) someone else is already using it, and (b) it is actively developed and has a large(r) community. This combination means that should someone else inherit the maintenance work they do not need to read arcane scripts but rely on pretty widespread functionality. Another benefit of relying on something widespread is that the fundamental implementation is well-tested. Compared to some in-house-cooked solution which may or may not be tested, a heavily used tool spearheaded by RedHat should be somewhat reliable. Finally, I have some friends that, while working for other employers, are using Ansible in their dayjob, which means I can nonchalantly bug them via my favorite instant-messaging service and ask questions.
Ansible Inventory And Initial Playbooks
To start with Ansible, I followed their getting-started documentation with good initial progress. First, I created an example inventory that had only one machine listed in it. The machine was one of the machines that were not yet fully provisioned. Meaning that, should I screw up something, it would not really matter. I simply gave that machine a name and then used the inventory to run some ad-hoc commands. Mostly to just understand what tools I needed to install on the machine I was working on, and how the whole workflow works.
Once I got a bit comfortable with the concepts of inventory files and running commands via Ansible, I looked at an easy task to automate first via a playbook. Playbooks are Ansible’s “unit of abstraction”, or whatever you want to call it, for automating a single task. So, my initial playbook would simply update the package manager database and then update the packages on the node. From this initial playbook, I then expanded to also implement other playbooks with more complex tasks. The next playbooks I implemented were all what is required to provision a machine into a state from which the LLVM GPU offload tests can be run. This means: install AMDGPU kernel fusion driver, install a minimal set of ROCm components and install a few more OS-level packages, like CMake or ninja-build.
Once these starter things were automated, the next thing on the list of things to automate was to install the docker runtime and automate the creation of the docker containers which are used to host the final testing setup, i.e., the Buildbot builders. Since the Dockerfiles are made publicly available, the actual set up of the CI runners, i.e., the buildbots, is handled via Ansible to not put any credentials into the public. Even though at this point the deployment of the software stack was fully automated, one critical piece was still missing: Adding some required kernel boot parameters to make the GPU available to the operating system via some IOMMU flag setting. I left this to the very end on purpose since I was still thinking I should be comfortable with Ansible when I automate the change of kernel boot parameters.
Questions And Struggles
While many things went well, not all is gold. The way I set up the inventory is probably to be considered a disorganized mess. Even though I just recently refactored the inventory, I still think that the organization is far from perfect. The typical examples show some web-servers and some database-servers. This lends itself to a somewhat clean separation. For the buildbot setup, it seems less clear. Distinction could be by: different OS-kernels, different ROCm versions, different dockerized OS versions, and many other factors.
The other topic that I still struggle with is finding the best way to not — by accident — deploy something to the wrong set of nodes. On that topic I’m quite sure that I’m using something not quite right. I have by accident reinstalled ROCm on the wrong node probably a dozen times by now. While I have now changed the playbooks to not run on any default host (and expect a command-line option to be given instead) I’m not sure this is the best solution. Mostly because this makes it non-obvious what was deployed were from just looking at the playbooks.
Final Thoughts
Working with Ansible to automate the provisioning of the Buildbot nodes has been pretty positive so far. I think it’s great that the inventory and the playbooks are managed in a git directory and thus make changes traceable. Automating the standard tasks to provision these nodes was certainly helpful. More so, I found it particularly good for the nodes that we use for dockerized Builders. These containers are bound to specific CPU core sets, since we run multiple containers on a single node. Previously this was documented, but the actual deployment was done manually. Now, it is documented through the playbook, which is also used to deploy the containers.
No responses yet