LXC-containers based CI with integration tests
Following up on @tkrizek's work on LXC basec CI runners capable of runnning nested containers, I reconfigured the CI to run our integration tests. However, I didn't manage to configure well for caching, so it rebuilds the containers every single time it runs and it takes whopping 30 minutes.
The CI is setup in a way that it mirrors perfectly what happens on dev machines. There is nothing special about the commands it runs, it works the same.
Another change connected with this MR is a rewrite of container management script. Now, fully in Python. All container related operations are now managed centrally with better (CLI) UI and better maintainability.
Known issues
Race conditions
- the CI builds container images every time it runs
- the images are used as the base image in the following run of the CI
=> all in all, this allows for potential race conditions due to the shared global state in the container registry. As of now, I don't see it as a big problem, because any issue can be trivially fixed by rerunning the CI pipeline manually.
Possible mitigations could be:
- naming the container images based on the branch they were created from
- running container build automatically only on master branch
- stop using the container registry and rely on shared file cache with serialized container images (
podman save
,podman load
)
Long runtime
I tried but failed to make it run faster and it takes about 30 minutes. Where is the problem?
- Container images consist of multiple layers, typically a layer for each
Containerfile
command. - When rebuilding a previously built image, the built tool (in
podman
's casebuildah
) finds the lowest layer that would be changed due to source files change (either a change in command or change of data inCOPY
/ADD
commands)
The problem is with the second bullet point. The information needed to find the highest common layer are not stored in the image itself. They are stored next to it in some build cache. Because this cache is cleaned every time the CI runs, the build is unable to lookup anything and falls back to building from scratch.
Docker-in-Docker sidesteps this issue, because it has a daemon with single cache and single storage (on one host machine), regardless of the nesting level. Even more, it supports --cache-from
option which should pull image before building in order to speed up the build. I don't however know, how it's able to find the highest common layer.
(This is my current understanding, not sure if it's exactly correct...)
Is there a solution to this? The only one I am sure would work is using dedicated single runner. That would ensure the cache would always stay the same.