In our MetaCG Gitlab CI setup we had to disable and fiddle around with the CI stage that builds the container and runs tests therein. This is mainly due to the infrastructure on which we run our internal CI, which is, let’s say, not focused on providing a container-first environment. We ended up with disabling the job and moving on. Then, some time later, we re-enabled the job and this is the start of this story.
As part of re-enabling the job that builds the container and runs the tests inside the container, let’s call it “the container job”, we changed from the stage-organized CI setup to a job-organized CI setup. This means that the jobs and their dependencies create a directed acyclic graph (DAG). This allows for a more fine-grain modeling and adjusting for starkly differing runtimes. This was the motivation for us, since the container job takes significantly longer than the other jobs. In the stage-organized CI setup, this lead to considerable delays in CI turnaround time.
When the changes were committed, we observed intermittent CI failures. Unfortunately, we observed these only some time later and not immediately. Moreover, the file system on the machine that runs the CI is sometimes a bit finicky and so we blamed the intermittent failures on the file system. At some point, however, it became apparent that these errors were not due to the file system. All errors were of a specific form.
# The error seen in the failing CI jobs. Pointing to a missing directory
shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
When investigating more and more thoroughly inspecting the Gitlab CI configuration file, I eventually identified the issue: two git strategies across jobs. As apparent as this seems in hindsight it was not easily to spot initially. So, if you find this error in your CI, you likely want to check if multiple jobs define a GIT_STRATEGY
to be clone
.
If your pipeline runs a job with a GIT_STRATEGY
set to clone
while another job is running, the clone strategy may — and I bet it will at some point — interfere with the other job and happily overwrite / delete existing directories and files. We finally fixed this issue in MetaCG with this commit.
Hope this helps someone at some point to find their CI bug faster, so they spend less time on it than I did.
Comments are closed