How to use multi-core CPU to speed up your Linux commands (GNU Parallel)

How to use multi-core CPU to speed up your Linux commands (GNU Parallel)

Have you ever had the need to compute a very large amount of data (hundreds of GB)? Or search inside it, or some other operation - something that cannot be parallelized. Data experts, I’m talking to you. You may have a CPU with 4 cores or more, but our appropriate tools, such as grep, bzip2, wc, awk, sed, etc., are single-threaded and can only use one CPU core.

To paraphrase the cartoon character Cartman, “How can I use these cores?”

To make Linux commands use all CPU cores, we need to use the GNU Parallel command, which allows all CPU cores to do magical map-reduce operations in a single machine, of course, this also requires the help of the rarely used –pipes parameter (also called –spreadstdin). This way, your load will be evenly distributed across the CPUs, really.

BZIP2

bzip2 is a better compression tool than gzip, but it is slow! Don't worry, we have a way to solve this problem.

Previous practice:

cat bigfile.bin | bzip2 --best > compressedfile.bz2

Now like this:

cat bigfile.bin | parallel --pipe --recend '' -k bzip2 --best > compressedfile.bz2

Especially for bzip2, GNU parallel is super fast on multi-core CPUs. Before you know it, it's done.

GREP

If you have a very large text file, you might have previously done this:

grep pattern bigfile.txt

Now you can do:

cat bigfile.txt | parallel --pipe grep 'pattern'

Or like this:

cat bigfile.txt | parallel --block 10M --pipe grep 'pattern'

This second usage uses the --block 10M parameter, which means that each core processes 10 million rows - you can use this parameter to adjust how many rows of data are processed by each CPU core.

AWK

Below is an example of using awk command to calculate a very large data file.

General usage:

cat rands20M.txt | awk '{s+=$1} END {print s}'

Now like this:

cat rands20M.txt | parallel --pipe awk \'{s+=\$1} END {print s}\' | awk '{s+=$1} END {print s}'

This is a bit complicated: the --pipe parameter in the parallel command divides the cat output into multiple blocks and dispatches them to the awk call, forming many sub-computation operations. These sub-calculations are piped into the same awk command via a second pipeline, which outputs the final result. The first awk has three backslashes, which is required by GNU parallel to call awk.

WC

Want to count the number of lines in a file as quickly as possible?

Traditional approach:

wc -l bigfile.txt

Now you should have this:

cat bigfile.txt | parallel --pipe wc -l | awk '{s+=$1} END {print s}'

Very clever, first use the parallel command to 'map' a large number of wc -l calls into sub-calculations, and finally send them to awk through the pipe for aggregation.

SED

Want to use sed command to do a lot of replacement operations in a huge file?

Conventional practice:

sed s^old^new^g bigfile.txt

Now you can:

cat bigfile.txt | parallel --pipe sed s^old^new^g

…and then you can use the pipe to store the output into a specific file.

The above is the full content of this article. I hope it will be helpful for everyone’s study. I also hope that everyone will support 123WORDPRESS.COM.

You may also be interested in:
  • 15-minute parallel artifact GNU Parallel Getting Started Guide

<<:  How to install the green version of MySQL Community Server 5.7.16 and implement remote login

>>:  Simple analysis of EffectList in React

Recommend

Analysis of Facebook's Information Architecture

<br />Original: http://uicom.net/blog/?p=762...

Three common uses of openlayers6 map overlay (popup window marker text)

Table of contents 1. Write in front 2. Overlay to...

Solution to uninstalling Python and yum in CentOs system

Background of the accident: A few days ago, due t...

How to move mysql5.7.19 data storage location in Centos7

Scenario: As the amount of data increases, the di...

MySQL 8.0.22 winx64 installation and configuration method graphic tutorial

The database installation tutorial of MySQL-8.0.2...

XHTML introductory tutorial: Use of list tags

Lists are used to list a series of similar or rela...

Solution to the error in compiling LVGL emulator on Linux

Table of contents 1. Error phenomenon 2. Error An...

How to run nginx in Docker and mount the local directory into the image

1 Pull the image from hup docker pull nginx 2 Cre...

Steps to configure nginx ssl to implement https access (suitable for novices)

Preface After deploying the server, I visited my ...

Detailed tutorial on compiling and installing MySQL 5.7.24 on CentOS7

Table of contents Install Dependencies Install bo...

Pure JS method to export table to excel

html <div > <button type="button&qu...

HTML+CSS implementation code for rounded rectangle

I was bored and suddenly thought of the implementa...

jQuery implements accordion small case

This article shares the specific code of jQuery t...