15-minute parallel artifact GNU Parallel Getting Started Guide

15-minute parallel artifact GNU Parallel Getting Started Guide

GNU Parallel is a shell tool for executing computational tasks in parallel on one or more computers. This article briefly introduces the use of GNU Parallel.

This cpu is multi-core.

Generally, two cores work like this:

This is how quad cores work:

Here is how the 16 cores work:

Okay, it’s not dark anymore. If you continue to criticize Intel, I will be beaten.

One weekend morning when I was bored, I spent half a day going through the man page and tutorial of gnu parallel. Haha, I have to say that this half day is well worth spending, because I feel it can save me more than half a day in the future.

This article does not attempt to translate the gnu parallel man page or tutorial. Because there are ready-made translations, you can see them here or here.

But after seeing the weird ::: and the strange {}{#}{.}{\} placeholders in parallel a few times ago, I backed off. Such ugly syntax is unattractive. Fortunately, I looked at a few examples to calm myself down, and then tried it myself, and found that it was really a magical tool.

The main purpose of this article is to lure you into using this tool and tell you why and how to use it.

why

There is only one purpose for using gnu parallel, and that is to be fast!

Fast installation

(wget -O - pi.dk/3 || curl pi.dk/3/) | bash

The author said it takes 10 seconds to install. The actual situation in the country may not be enough. But it doesn’t take too long. In fact, it is a single-file Perl script with more than 10,000 lines (yes, you read that right, all modules are in this file, this is a feature~). After that, I wrote fabric scripts and copied them directly to each node machine. Then chmod the execution permission.
Then there is fast execution, which will execute your program in parallel using multiple cores of the system:
Above:

grep a 1G log.

Using parallel, and directly grep without parallel. The result is obvious, a difference of 20 times. This is much more effective than using ack or ag optimization.

Note: This is the result of executing on a 48-core server.

how

The easiest way is to use xargs. There is a parameter -P in xargs that can take advantage of multiple cores.

For example:

$ time echo {1..5} |xargs -n 1 sleep

real 0m15.005s
user 0m0.000s
sys 0m0.000s

This line of xargs passes each echo number as a parameter to sleep, so the total sleep time is 1+2+3+4+5=15 seconds.

If the -P parameter is used to allocate the data to 5 cores, each core will sleep for 1, 2, 3, 4, and 5 seconds, so the total sleep time after execution is 5 seconds.

$ time echo {1..5} |xargs -n 1 -P 5 sleep

real 0m5.003s
user 0m0.000s
sys 0m0.000s

The preparation is over. Generally, the first mode of parallel is to replace xargs -P.

For example, compress all HTML files.

find . -name '*.html' | parallel gzip --best

Parameter transfer mode

The first mode is to use parallel parameter passing. The commands coming in from the front of the pipeline are passed as parameters to the commands that follow and are executed in parallel.

for example

huang$ seq 5 | parallel echo pre_placeholder_{}
pre_placehoder_1
pre_placehoder_2
pre_placehoder_3
pre_placehoder_4
pre_placehoder_5

{} is a placeholder used to hold the incoming parameters.

In cloud computing operations, batch operations are often performed, such as creating 10 cloud hard disks.

seq 10 | parallel cinder create 10 --display-name test_{}

Create 50 cloud hosts

Copy the code as follows:
seq 50 | parallel nova boot --image image_id --flavor 1 --availability-zone az_id --nic vnetwork=private --vnc-password 000000 vm-test_{}

Deleting cloud hosts in batches

nova list | grep some_pattern | awk '{print $2}' | parallel nova delete

Rewrite the for loop

As you can see, I actually replaced many places where loops need to be written with parallel, and enjoyed the convenience brought by parallelism.
The reason is that when performing a for loop, it is most likely to be parallelized because the objects placed in the loop are context-independent.

Universal abstraction, shell loop:

 (for x in `cat list`; do
 do_something $x
 done) | process_output

Can be written directly

 cat list | parallel do_something | process_output

If there are too many contents in the loop

 (for x in `cat list`; do
 do_something $x
 [... 100 lines that do something with $x ...]
 done) | process_output

It's better to write a script

 doit() {
 x=$1
 do_something $x
 [... 100 lines that do something with $x ...]
 }
 export -f doit
 cat list | parallel doit

And it can also avoid a lot of troublesome escapes.

--pipe mode

Another mode is parallel --pipe

At this time, the command in front of the pipeline is not used as a parameter, but as standard input to the following command.

For example:

cat my_large_log |parallel --pipe grep pattern 

Without --pipe, each line in mylog is expanded into a grep pattern line command. With --pipe, the command is no different from cat mylog | grep pattern, except that the commands are distributed to different cores for execution.

Okay, that’s the basic concept! The rest are just the specific usage of various parameters, such as how many cores to use, place_holder replacement, various ways to pass parameters, parallel execution but ensuring the order of result output (-k), and the magical cross-node parallel computing. Just look at the man page to find out.

bonus

Having a small tool to convert to parallel at hand, in addition to making your daily execution faster, another benefit is to test concurrency.

Many interfaces will have some bugs under concurrent operations. For example, some judgements are made at the code level that the database is not locked. As a result, concurrent requests are made, and each request is judged to be passed when it reaches the server. When they are written together, the limit is exceeded. Previously, the for loop was executed serially and did not trigger these problems. But if you really want to test concurrency, you have to write a script or use Python's mulitiprocessing to encapsulate it. But I have parallel at hand, and added the following two aliases in bashrc

alias p='parallel'
alias pp='parallel --pipe -k' 

It is very convenient to create concurrency in this way. I only need to add a p after the pipeline, and I can create concurrency at any time to observe the response.

For example

seq 50 | p -n0 -q curl 'example.com'

Make concurrent requests based on the number of your cores. -n0 means that the seq output is not passed as a parameter to the subsequent command.

Gossip time: Xianglin Sao of GNU

As a lover of free software gossip, every time I discover a new and interesting software, I always google the keyword site:https://news.ycombinator.com and關鍵詞site:http://www.reddit.com/ . Check out the reviews and you may find unexpected things during the discussions.

Then I saw a complaint on hacker news, which basically said that every time you trigger the execution of parallel, a text will pop up telling you that if you use this tool for academic purposes (many life science-related people are using this tool), you have to cite his paper, otherwise you will pay him 10,000 euros. I learned a word from this, called Nagware, which refers specifically to software that nags you like Tang Seng to get you to pay. Although I think the article should be cited if it is really used, as this student said:

I agree it's a great tool, except for the nagware messages and their content. Imagine if the author of cd or ls had the same attitude...

In addition, the author really likes others to cite his software, so much so that I also saw it in NEWS:

Principle time

Directly quote the author's answer on stackoverflow

GNU Parallel is a general parallelizer and makes it easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to.

If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:

GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:

in conclusion

This article mainly introduces a real parallel tool, explains its two main modes, gives a tip, and gossips about the unknown side of the GNU world. Hope it's useful for you.

The above is the full content of this article. I hope it will be helpful for everyone’s study. I also hope that everyone will support 123WORDPRESS.COM.

<<:  MySQL 5.6.27 Installation Tutorial under Linux

>>:  Angular framework detailed explanation of view abstract definition

Recommend

How to use linux commands to convert and splice audio formats

Install FFmpeg flac eric@ray:~$ sudo apt install ...

Detailed explanation of Linux environment variable configuration strategy

When customizing the installation of software, yo...

The easiest way to debug stored procedures in Mysql

A colleague once told me to use a temporary table...

How to monitor global variables in WeChat applet

I recently encountered a problem at work. There i...

MySQL View Principle Analysis

Table of contents Updatable Views Performance of ...

CSS3 text animation effects

Effect html <div class="sp-container"...

Several ways to clear arrays in Vue (summary)

Table of contents 1. Introduction 2. Several ways...

Restart all stopped Docker containers with one command

Restart all stopped Docker containers with one co...

Detailed explanation of the principle and function of Vue list rendering key

Table of contents The principle and function of l...

MySQL 8.0.15 installation and configuration graphic tutorial

This article records the installation and configu...

Ten Experiences in Web Design in 2008

<br />The Internet is constantly changing, a...

Do you know how to use mock in vue project?

Table of contents first step: The second step is ...

Implementation of Nginx configuration of local image server

Table of contents 1. Introduction to Nginx 2. Ima...