Why Google and Facebook don't use Docker

Why Google and Facebook don't use Docker

The reason for writing this article is that I want to make the modified distributed PyTorch program start faster on Facebook's cluster. The exploration process was interesting and also demonstrated the knowledge system required for industrial machine learning.

I worked at Google for three years right after I graduated in 2007. At that time, I thought the distributed operating system Borg was really easy to use.

Since leaving Google in 2010, I have been looking forward to its open source development until the emergence of Kubernetes.

The computing unit scheduled by Kubernetes is containers (the accurate translation is "container", not the general "container". You can understand the author's intention by looking at what is drawn on the Docker company's logo).

A container executes an image just like a process executes a program.

I'm afraid that both Googlers and ex-Googlers have never come across the concepts of container and image when using Borg. Why are these two concepts not available in Borg, but introduced in Kubernetes?

This question flashed through my mind and was ignored. After all, I was responsible for more open source projects later, such as Baidu Paddle and Ant's SQLFlow and ElasticDL, and Docker was very easy to use. So I didn’t think much about it.

Earlier this year (2021), I joined Facebook. Coincidentally, Facebook published a paper [1] introducing its distributed cluster management system Tupperware.

However, Tupperware is a brand registered in 1946 https://en.wikipedia.org/wiki/Tupperware_Brands, so I had to use another name Twine in the paper.

Because many people in the industry know the name Tupperware, I will not talk about Twine in this article.

In short, the publication of this paper prompted me to review the previous problem - there is no Docker in Facebook!

After having a detailed chat with several new and old colleagues from the Facebook Tuppware team and Google Borg, it finally dawned on me. Because there is no relevant review in the industry, this article is for record.

In a word

Simply put, if you use a monolithic repository to manage your code, you don't need "packages" such as Docker images (or ZIP, tarball, RPM, deb).

A monolithic repo is a repo where all the code for all the projects of a company is concentrated in one (or very few) repos.

Because a monolithic repository must have a supporting unified build system, otherwise it will not be possible to compile such a large amount of code.

Since there is a unified build system, once it is found that a module that a program that a cluster node needs to execute depends on has changed, the module can be synchronized to this node. No need to package and resync at all.

On the contrary, if each project is in a separate git/svn repo and uses a different build system, such as each open source project is in a different GitHub repo, then the build results of each project need to be packaged.

The Docker image supports a layered package format, so we only need to transfer the top layers that contain the modified items, and try to reuse the lower layers that are cached by the node.

Both Google and Facebook use monolithic repositories and have their own build systems (my old article Finding Google Blaze[2] explains Google’s build system), so there is no need for “packages” and of course no need for Docker images.

However, both Borg and Tupperware have containers (using some system calls provided by the Linux kernel, such as the cgroup contributed to the Linux kernel by the Google Borg team more than ten years ago) to achieve isolation between jobs.

It’s just that if people don’t need to build Docker images, the existence of containers will not be easily noticed.

If you do not want to be blinded by the above and want to delve into this issue in detail, then wait for me to peel off the R&D technology system and computing technology system of Google and Facebook layer by layer.

Packaging

When we submit a distributed job to a cluster for execution, we need to transfer the program to be executed (including an executable file and related files, such as *.so, *.py) to some machines (nodes) assigned to this job by the scheduling system.

Where do these files to be packaged come from? It was built at that time. In Google there is Blaze, in Facebook there is Buck.

Those who are interested can take a look at the open source version of Google Blaze, Bazel[3], and the open source version of Facebook Buck[4].

But a reminder: the internal versions of Blaze and Facebook Buck are used for monolithic repos, while the open source versions are for everyone to use non-mono repos, so there are differences in concepts and implementations, but you can still get a feel for the basic usage methods.

Suppose we have the following module dependencies, described in Buck or Bazel syntax (the syntax is almost the same):

python_binary(name="A", srcs=["A.py"], deps=["B", "C"], ...)
python_library(name="B", srcs=["B.py"], deps=["D"], ...)
python_library(name="C", srcs=["C.py"], deps=["E"], ...)
cxx_library(name="D", srcs=["D.cxx", "D.hpp"], deps="F", ...)
cxx_library(name="E", srcs=["E.cxx", "E.hpp"], deps="F", ...)

Then the module (build result) dependencies are as follows:

A.py --> B.py --> D.so -\
     \-> C.py --> E.so --> F.so

If it is an open source project, please use your imagination to replace the above modules with projects such as GPT-3, PyTorch, cuDNN, libc++, etc.

Of course, each project contains multiple modules and depends on other projects, just as each module has multiple sub-modules.

Tarball

The simplest way to package is to package the above files {A,B,C}.py, {D,E,F}.so into a file A.zip or A.tar.gz.

More precisely, the file name should include the version number. For example, A-953bc.zip, where version number 953bc is the git/Mercurial commit ID.

Introducing the version number can help cache the file locally on the node, so that the next time you run the same tarball, you don't need to download the file.

Please note that I introduced the concept of package caching here. Prepare for the following explanation of Docker.

XAR

After the ZIP or tarball file is copied to the cluster node, it needs to be unzipped to somewhere on the local file system, for example: /var/packages/A-953bc/{A,B,C}.py,{D,E,F}.so.

A slightly cooler way is to not use Tarball, but put the above files in a loopback device image of an overlay filesystem. In this way, "unzip" becomes "mount".

Please note that I introduced the concept of loopback device image here. Prepare for the following explanation of Docker.

What is a loopback device image? In Unix, a directory tree of files is called a filesystem.

Typically a filesystem is stored on a block device. What is a block device?

Simply put, any storage space that can be viewed as a byte array is a block device.

For example, a hard disk is a block device. The process of creating an empty directory tree structure in a newly purchased hard disk is called formatting.

Since a block device is just a byte array, isn't a file also a byte array?

Yes! In the Unix world, we can create an empty file of a fixed size (using the truncate command), and then "format" the file to create an empty file system in it. Then put the above files {A,B,C}.py,{D,E,F}.so into it.

For example, Facebook open-sourced the XAR file[5] format. This is for use with Buck.

If we run buck build A, we will get A.xar. This file includes a header and a squashfs loopback device image, referred to as squanshfs image.

Here squashfs is an open source file system. Interested friends can refer to this tutorial [6], create an empty file, format it into squashfs, and then mount it to a directory (mount point) in the local file system.

When we umount, the files that were added to the mount point will remain in this "empty file".

We can copy it and distribute it to other people, and everyone can mount it and see the files we added.

Because XAR adds a header in front of the squashfs image, you cannot mount it using the mount -t squashf command. You have to use the mount -t xar or xarexec -m command.

For example, if a node has /packages/A-953bc.xar, we can use the following command to view its contents without consuming CPU resources to decompress it:

xarexec -m A-953bc.xar

This command will print out a temporary directory, which is the mount point of the XAR file.

Layering

If we modify A.py now, the entire package needs to be updated regardless of whether it is built into a tarball or XAR.

Of course, as long as the build system supports cache, we do not need to regenerate each *.so file.

But this does not solve the problem of having to redistribute the .tar.gz and .xar files to each node in the cluster.

The old version of A-953bc87fe.{tar.gz,xar} may be present on the node before, but it cannot be reused. For reuse, layering is required.

For the above situation, we can construct multiple XAR files based on the module dependency graph.

A-953bc.xar --> B-953bc.xar --> D-953bc.xar -\
            \-> C-953bc.xar --> E-953bc.xar --> F-953bc.xar

Each XAR file contains only the files generated by the corresponding build rule. For example, F-953bc.xar contains only F.so.

In this way, if we only modify A.py, then only A.xar needs to be rebuilt and transferred to the cluster nodes. This node can reuse the previously cached {B,C,D,E,F}-953bc.xar file.

Assuming that a node already has /packages/{A,B,C,D,E,F}-953bc.xar, can we run the xarexec -m command in the order of module dependencies to mount these XAR files to the same mount point directory in turn to obtain all the contents in them?

Unfortunately, no. Because the next xarexec/mount command will report an error - because the mount point has been occupied by the previous xarexec/mount command.

Here's why filesystem images are better than tarballs.

Then, taking a step back, instead of using XAR, why not use ZIP or tar.gz? Yes, but slowly. We can extract all .tar.gz files into the same directory.

But if A.py is updated, we cannot identify the old A.py and replace it with the new one. Instead, we have to re-decompress all .tar.gz files to get a new folder. And re-extracting all {B,C,D,E,F}.tar.gz is slow.

Overlay Filesystem

There is an open source tool fuse-overlayfs. It can "overlay" several directories.

For example, the following command "overlays" the contents of the directories /tmp/{A,B,C,D,E,F}-953bc into the directory /pacakges/A-953bc.

fuse-overlayfs -o \
  lowerdir="/tmp/A-953bc:/tmp/B-953bc:..." \
  /packages/A-953bc

The directories /tmp/{A,B,C,D,E,F}-953bc come from xarcexec -m /packages/{A,B,C,D,E,F}-953bc.xar.

Please note that I introduced the concept of overlay filesystem here. Prepare for the following explanation of Docker. How does fuse-overlayfs do this?

When we access any file system directory, such as /packages/A, the command-line tools we use (such as ls) invoke system calls (such as open/close/read/write) to access the files in it.

These system calls deal with the file system driver - they will ask the driver: Is there a file called A.py in the directory /packages/A?

If we use Linux, generally speaking, the file system on the hard disk is ext4 or btrfs. In other words, the Linux universal filesystem driver will look at the file system of each partition and then forward the system call to the corresponding ext4/btrfs driver for processing.

General filesystem drivers run in kernel mode like other device drivers.

This is why we usually need sudo when running commands such as mount and umount that operate filesystems. FUSE is a library for developing filesystem drivers in userland.

The fuse-overlayfs command uses the FUSE library to develop a fuse-overlayfs driver that runs in userland.

When the ls command asks the overlayfs driver what is in the /packages/A-953bc directory, the fuse-overlayfs driver remembers that the user has previously run the fuse-overlayfs command to overlay the /tmp/{A,B,C,D,E}-953bc directories, so it returns the files in these directories.

At this time, because the directories /tmp/{A,B,C,D,E}-953bc are actually mount points of /packages/{A,B,C,D,E,F}-953bc.xar, each XAR is equivalent to a layer.

A filesystem driver that "overlays" multiple directories, like the fuse-overlayfs driver, is called an overlay filesystem driver, sometimes referred to as overlay filesystems.

Docker Image and Layer

As mentioned above, overlay filesystem is used to achieve layering. Anyone who has used Docker will be familiar with the fact that a Docker image consists of multiple layers.

When we run the docker pull <image-name> command, if the local machine has already cached some layers of this image, we will skip downloading these layers. This is actually achieved using the overlay filesystem.

The Docker team developed a filesystem (driver) called overlayfs - this is the name of a specific filesystem.

As the name implies, Docker overlayfs also implements the "overlay" capability, which is why we see that each Docker image can have multiple layers.

Docker's overlayfs and its successor overlayfs2 both run in kernel mode.

This is one of the reasons why Docker requires root privileges on the machine, which is in turn why Docker is criticized for easily causing security loopholes.

There is a filesystem called btrfs, which has developed rapidly in the Linux world in recent years and is very effective in managing hard disks.

This filesystem driver also supports overlay. So Docker can also be configured to use this filesystem instead of overlayfs.

However, Docker can only use btrfs to overlay layers on top of it if the local filesystem of the Docker user's computer is btrfs.

So, if you are using macOS or Windows, you definitely cannot use btrfs with Docker.

But if you are using fuse-overlayfs, then you are using a panacea. The performance of a filesystem running in userland through FUSE is very average, but the situation discussed in this article does not require much performance.

In fact, Docker can also be configured to use fuse-overlayfs. A list of hierarchical filesystems supported by Docker is available at Docker storage drivers[7].

Why do we need Docker Image?

To summarize the above, from programming to running on the cluster, we need to do several steps:

  1. Compile: compile source code into executable form.
  2. Packaging: Put the compilation results into a "package" for deployment and distribution
  3. Transport: Usually a cluster management system (Borg, Kubernetes, Tupperware). If you want to start a container on a cluster node, you need to transfer the "package" to this node, unless this node has run this program before and already has a cache of the package.
  4. Unpacking: If the package is a tarball or zip file, it needs to be unpacked after it arrives on the cluster node; if the package is a filesystem image, it needs to be mounted.

Dividing the source code into modules allows the compilation step to take full advantage of the fact that each modification only changes a small part of the code, and only recompile the modified modules, thus saving time.

To save time on 2, 3 and 4, we want the "packages" to be hierarchical. Each layer should preferably contain only one or a few code modules. In this way, the dependencies between modules can be exploited to reuse the "layers" that contain the underlying modules as much as possible.

In the open source world, we use Docker images to support layered features. A base layer may only include userland programs of a certain Linux distribution (such as CentOS), such as ls, cat, grep, etc.

On top of that, there can be a layer including CUDA. Then install Python and PyTorch on it. The next layer above is the training procedure for the GPT-3 model.

In this way, if we only modify the GPT-3 training procedure, there is no need to repackage and transfer the following three layers.

The core of the logic here is that there is the concept of "project". Each project can have its own repo, its own building system (GNU make, CMake, Buck, Bazel, etc.), and its own release.

So each project's release is put into a layer of the Docker image. Together with its preceding layers, it is called an image.

Why Google and Facebook don't need Docker

After all the knowledge preparation mentioned above, we can finally get to the point.

Because Google and Facebook use a monolithic repository and a unified build system (Google Blaze or Facebook Buck).

Although you can also use the concept of "project" to load the build result of each project into a layer of the Docker image. But it is not actually necessary.

By using the modules defined by Blaze and Buck build rules and the dependencies between modules, we can completely get rid of the concept of packaging and unpacking.

Without packages, there is no need for zip, tarball, Docker image and layers.

Just treat each module as a layer. If D.so is recompiled because we modified D.cpp, then we only need to retransmit D.so, without having to transmit a layer including D.so.

So, Google and Facebook benefit from monolithic repositories and unified build tools.

We shortened the above four steps to two:

  1. Compile: compile source code into executable form.
  2. Transfer: If a module is recompiled, transfer this module.

Google and Facebook are not using Docker

The previous section mentioned that monolithic repo allows Google and Facebook to not need Docker images.

The reality is that Google and Facebook are not using Docker. There is a difference between these two concepts.

Let’s first say “not in use”. Historically, Google and Facebook used hyperscale clusters before the advent of Docker and Kubernetes. At that time, for the convenience of packaging, there was not even a tarball.

For C/C++ programs, direct full static linking, no *.so at all. So an executable binary file is a "package".

To this day, when people use open source Bazel and Buck, they can still see that the default linking method is full static linking.

Although Java is a "fully dynamically linked" language, its birth and evolution coincided with the historical opportunities of the Internet. Its developers invented the jar file format, which supports fully static linking.

Python language itself does not have a jar package, so Blaze and Bazel invented the PAR file format (called subpar in English), which is equivalent to designing a jar for Python. An open source implementation is available here [8].

Similarly, Buck invented the XAR format, which is what I mentioned above with a header added in front of the squashfs image. Its open source implementation is available here [9].

The Go language is fully statically linked by default. In some of Rob Pike's early summaries, it was mentioned that the design of Go, including full static linking, was basically a circumvention of the various pitfalls encountered in Google's C/C++ practice.

Friends who are familiar with the Google C++ style guide should feel that Go syntax covers the "C++ syntax that should be used" as stated in the guide, but does not support the "C++ parts that should not be used" as stated in the guide.

Simply put, historically Google and Facebook have not used Docker images. One important reason is that their build systems can fully statically link programs in various common languages, so the executable file is the "package".

But this is not the best solution, after all, there is no stratification. Even if I just modified a line of code in the main function, recompiling and publishing would take a long time, ten minutes or even dozens of minutes. You should know that the executable file obtained by full static linking is often measured in GB in size.

So although full static linking is one of the reasons why Google and Facebook are not using Docker, it is not a good choice.

Therefore, it was not followed by other companies. People still prefer to use Docker images that support hierarchical caching.

Technical Challenges of a Perfect Solution

The perfect solution would support hierarchical caching (or more precisely, block caching). Therefore, you should still use the features of the monolithic repo and unified build system introduced above.

But there is a technical challenge here, the build system is described in modules, and modules are usually much more fine-grained than "projects".

Taking C/C++ language as an example, if each module generates a .so file as a "layer" or "block" to serve as a cache unit, then the number of .so files that an application may need is too large.

When you start the application, it may take dozens of minutes to resolve symbols and complete the link.

So, although the monolithic repo has many benefits, it also has a disadvantage. Unlike in the open source world, everyone manually breaks down the code into "projects".

Each project is usually a GitHub repo, which can have many modules, but all modules in each project are built into a *.so as a cache unit.

Because the number of projects that an application depends on will never be too many, thus controlling the total number of layers.

Fortunately, this problem is not unsolvable. Since the dependency of an application on various modules is a DAG, we can always find a way to do a graph partitioning to decompose this DAG into several sub-graphs with fewer numbers.

Still taking C/C++ program as an example, we can compile each module in each subgraph into a .a, and link all .a in each subgraph into a *.so as a cache unit.

Therefore, how to design the graph partitioning algorithm becomes the most important issue at hand.

Related Links:

https://engineering.fb.com/2019/06/06/data-center-engineering/twine/

https://zhuanlan.zhihu.com/p/55452964

https://bazel.build/

https://buck.build/

https://github.com/facebookincubator/xar

https://tldp.org/HOWTO/SquashFS-HOWTO/creatingandusing.html

https://docs.docker.com/storage/storagedriver/select-storage-driver/

https://github.com/google/subpar

https://github.com/facebookincubator/xar

This is the end of this article about the principle analysis of Google and Facebook not using Docker. For more relevant content about Google and Facebook not using Docker, please search for previous articles on 123WORDPRESS.COM or continue to browse the related articles below. I hope everyone will support 123WORDPRESS.COM in the future!

You may also be interested in:
  • Use beaker to make Facebook's Bottle framework support session function
  • Implement ajax update without refresh similar to Facebook
  • Specific steps for obtaining Twitter and Facebook article counts using JSONP
  • Specific implementation of Facebook sharing function on web pages
  • How to display the number of Facebook fans using PHP

<<:  An example of how to write a big sun weather icon in pure CSS

>>:  JavaScript Timer Details

Recommend

Vue scaffolding learning project creation method

1. What is scaffolding? 1. Vue CLI Vue CLI is a c...

Detailed explanation of WeChat Mini Program official face verification

The mini program collected user personal informat...

A detailed discussion on detail analysis in web design

In design work, I often hear designers participati...

Use of vuex namespace

Table of contents Since Vuex uses a single state ...

Details on how to write react in a vue project

We can create jsx/tsx files directly The project ...

Implementing a simple whack-a-mole game in JavaScript

This article shares the specific code for JavaScr...

Vue implements scrollable pop-up window effect

This article shares the specific code of Vue to a...

Use of Linux cal command

1. Command Introduction The cal (calendar) comman...

Master the CSS property display:flow-root declaration in one article

byzhangxinxu from https://www.zhangxinxu.com/word...

How to install MySQL for beginners (proven effective)

1. Software Download MySQL download and installat...

How to reset the root password in CentOS7

There are various environmental and configuration...

Master-slave synchronous replication configuration of MySQL database under Linux

The advantage of the master-slave synchronization...

Specific use of lazy loading and preloading in js

Delayed loading (lazy loading) and preloading are...