Here's a little background. The place where I am consulting (Hi5.com) we need to perform rsync on a huge directory tree. And since we want this operation to be as fast as possible, the first measure the guys there took was to use rsync protocol, and not use rsync-over-ssh; thats a great speed booster.
Next, they (actually, Kenny Gorman) devised three scripts, that we need to run after each other; one to generate a list of all files in the directory we want to copy, second to split that list into 4 equal pieces, and the third to actually run these 4 pieces (batches) in background, in parallel.
The problem with this approach is that some batches finish quickly, because the files those batches are rsyncing are smaller than the files that other batches are working on. The result: we start with 4 parallel rsync commands, but somewhere down the line only one or two of them are running. We loose parallelism quite quickly, and end up waiting for the batch(es) containing large files, and that is processing files in sequential order.
So, I got to work trying to parallelize a bunch of commands that are placed in a file. This script reads lines from it's standard-input-stream (stdin) and executes those lines using the shell. At any time, it will run only a specified number of commands, and wait for them to finish. As soon as one of the running command finishes, this script reads next line from stdin and executes that.
I have also added the ability to change the degree of parallelism while this script is running. Just create a file named 'degree' in /tmp/parallel.$PID/ and and put a number in there, denoting the new degree of parallelism. This is quite useful in tweaking the degree of parallelism depending on your system load.
I have made no special efforts in redirecting the stdin/stdout/stderr of the commands that are read and executed by this script. So, if you wish to record the progress of this script, or wish to store away your commands' output, just redirect this script's streams and save them.
An example usage of this script can to remove all the files under a directory, in parallel (although it is a very bad example for such a simple task):
find /home/gurjeet/dev/postgres -type f | sed -e 's/\(.*\)/rm $0/g' > tmp.txt
cat tmp.txt | parallel.sh
Here's the script:
#!/bin/bash
# This script is licensed under GPL 2.0 license.
# This script uses some special features (look for 'wait' command)
# provided by Bash shell.
# get my pid
mypid=$$;
# determine a dir/ where I will keep my running info
MYDIR=/tmp/parallel.$mypid;
# echo my pid for the logs
echo PARALLEL: pid: $mypid;
# remove the directory/file if it is left over from a previous run
if [ -e $MYDIR ] ; then
rm -r $MYDIR
fi
# make my dir/
mkdir $MYDIR
# determine the degreee of parallelization
degree=$1;
# default degree of parallelism, if not specified on command line
if [ "X$degree" = "X" ] ; then
degree=2;
fi
# echo for logs
echo PARALLEL: Degree of parallelism: $degree;
# read each line from stdin and process it
while read line ;
do
while [ true ]; do
# re-adjust degree of parallelization communicated through this file
if [ -f $MYDIR/degree ] ; then
new_degree=`cat $MYDIR/degree`
rm $MYDIR/degree
fi
if [ $new_degree > 0 ] ; then
degree=$new_degree;
fi
# Look for a free slot
for (( i = 0 ; i < $degree ; ++i )) ; do
if [ ! -e $MYDIR/parallel.$i ]; then
break
fi
done
if [ $i -lt $degree ]; then
break
fi
# if can't find any free slot, repeat after a sleep of 1 sec
sleep 1;
done
# occupy this slot
( # echo PARALLEL: touching $MYDIR/parallel.$i;
touch $MYDIR/parallel.$i )
# perform the task in background, and free the slot when done
( echo PARALLEL: $degree $mypid;
sh -c "$line";
# echo PARALLEL: removing $MYDIR/parallel.$i;
rm $MYDIR/parallel.$i ) &
done
# Wait for all child processes to finish
wait;
# echo PARALLEL: removing base dir;
rm -r $MYDIR;
Hello,
ReplyDeleteThere are some improvements to this very useful script.
When reading from a file containing paths/filenames with whitespaces, or other special chars, BEFORE one HAVE to edit it and backslash escape all those special chars.
Then, the script must be modified.
When parsing escaped chars from script to script, they are usually interpreted.
The following code snippets should solve the script parsing problem:
# read each line from stdin and process it
# parse backslash escaped characters as is with -r
# http://bash-hackers.org/wiki/doku.php/mirroring/bashfaq/001
while read -r line ;
do
while [ true ]; do
*****
# perform the task in background, and free the slot when done
( echo PARALLEL: $degree $mypid;
# sh -c "$line";
# parse backslash escaped characters as is with eval
# http://lists.samba.org/archive/rsync/2002-January/001222.html
eval $line
echo PARALLEL: removing $MYDIR/parallel.$i;
rm $MYDIR/parallel.$i ) &
done
*******
Regards and thanks again for the useful script.
Andre Felipe Machado
If you are going to install a single script to get parallelism, you might aswell choose GNU Parallel.
ReplyDeleteIt supports the same syntax:
cat file.sh | parallel
But also support xargs-like:
cat args | parallel my_command
Watch the intro video to learn more: http://www.youtube.com/watch?v=OpaiGYxkSuQ