M0UNTAIN 0F C0DE

Parallel Processing + Progress

parallel

There have been a number of times (eg) when I've wanted to run a bunch of processes as fast as possible. I needed:

I'm sure I could download/compile something purpose-built, but I felt like I should be able to solve this with standard command line tools. I also thought it would be a fun challenge 😛

This is what I came up with:

ppp() {
    mkdir -p results;
    local worker="
        echo;
        hash=\$(echo '%' | sha256sum | head -c 16)
        [ -f \"results/\$hash\" ] && exit 0;
        (echo '%'; ($2) 2>&1) > \"results/\$hash\"
    "

    xargs --arg-file="$1" -I % --max-procs="$(nproc)" -- bash -c "${worker}" | \
        pv --progress --rate --eta --line-mode --size $(wc -l < $1) \
        > /dev/null
}

The usage is ppp /path/to/file 'command %'. Each line in /path/to/file is substituted in for the %.

For example, to curl a bunch of URLs: ppp urls 'curl %'

How It Works

It uses xargs with --max-procs to spawn processes, pv to monitor progress and a 4-line worker script to do the rest. The worker script is where most of the magic happens:

# Write a newline. This gets piped to pv which updates the progress bar
echo;

# Create a short hash of the line to process
hash=\$(echo '%' | sha256sum | head -c 16)

# Check if the line has already been processed
[ -f \"results/\$hash\" ] && exit 0;

# Execute the command and write the output to a unique file
(echo '%'; ($2) 2>&1) > \"results/\$hash\"

Result Files

The output of each process is written to a dedicated file: results/{HASH}. is the hash of the line in the source file. Both stdout and stderr are redirected.

Additionally, the first line of each result file is the line which was processed. This is to make it possible to find the result of a specific process via grep or similar.