October 09, 2005

Inserting Random Delay In Cron Jobs

There may become a time when you need to insert a random delay into a shell script. The follow script attains this in five lines of code.

Imagine that you have 100 servers -- all configured similarly with some flavor of Linux -- and there is some operation that you want to perform every five minutes. It could be a wget pull of spam filtering rules, or an rsync of a mirrored web site. It could be anything, really. Using cron, the simplest way to do this is putting this line in /etc/crontab:

*/5 * * * * root /some/operation

Note: These notes assume that the cron job will be run as root, however it could just as easily be run as another user.

In our case, we're running rsync to syncronize some definitions from a master host. But there's a problem, in that our rsync server is only capable (in default configuration, running from xinetd) of handling 30 concurrent requests. Because each of the 100 servers' time is syncronized via NTP, every five minutes 100 concurrent requests hit the master host. This results in xinetd shutting the listener down for a little while to prevent denial of service. So approximately 30 of the hosts get their update, and the other 70 die gracefully.

There's a few different ways we could fix this:

  1. Alter the crontab of each server to run the script at a slightly different time.
  2. Alter the script to insert a random delay before running the operation.
  3. Increase the maximum concurrent processes of the master host in xinetd.

The first item would take quite a bit of work on our part, having to manually change each system's crontab. The last item doesn't really scale too well, imagine if we scale up to 200 servers, 500, 2000. For our needs, we decided to insert a random amount of sleep time before running our syncronization. The thinking was that if the load of the syncronization would be spread out over the five minute period, that the master server wouldn't reach it's concurrency limit very often.

The sleep command only takes a single argument, the number of seconds to sleep for. While there are fancy ways to get random numbers, we decided to use a built-in shell variable. The following is from the Bourne shell manual, describing a pre-defined variable that is available for use:

RANDOM - Each time this parameter is referenced, a random integer between 0 and 32767 is generated. The sequence of random numbers may be initialized by assigning a value to RANDOM. If RANDOM is unset, it loses its special properties, even if it is subsequently reset.

We use a while loop to generate a random number (using $RANDOM) within a set range. We then sleep for that random duration before executing our command. Why a set range? So that your cron job runs before its called during the next cycle.

The range will depend on your cron job interval, how long the operation will take, etc. In our case, our rsync operation is short. We aren't syncronizing a lot of data. Taking that, and our known network speed into considerationg, we assume that it will never take more than 60 seconds to complete a sync. Therefore if we're to sync every five minutes, our sleep period can be no more than four minutes (or 240 seconds). You can easily tweak these numbers to fit your needs. Many operations don't need to be perfomed so frequently, while some might need to be more frequent. It just depends on your needs.

So, our sync script looks like this:

#!/bin/sh

# Grab a random value between 0-240.
value=$RANDOM
while [ $value -gt 240 ] ; do
  value=$RANDOM
done

# Sleep for that time.
sleep $value

# Syncronize.
/usr/bin/rsync -aqzC --delete --delete-after masterhost::master /some/dir/

One perk of this rsync script is that the sync script itself is updated along with the content (all located on the master host [e.g. masterhost]). This makes redeployment of the script painless and fast.

Note that our /etc/crontab doesn't change much. It just calls the script instead of rsync directly.

*/5 * * * * root /some/dir/sync

For our implementation, the results are pretty good. Over the course of four minutes sync requests are issued every few seconds, such that if we tail the rsync log file it appears that the server is generally busy (but not flooded) for the four minute period.

If you're planning on a large scale deployment, here are some other things to consider:

  • Checking that rsync isn't already running before calling it.
    A ps -ef piped to grep -c rsync, checked before initiating the random number assignment could fix that. There are other ways, but are beyond the scope of this article.
  • Managing the load on the master server.
    If the load on the master server ever got to the point where it was hitting our concurrent session limit, we'd likely deploy a secondary master, or even a third. We could use either round robin DNS to balance the load across the systems, or another sort of load balancing (either in the script or via hardware). The script would be changed to reference the round robin DNS record (or virtual) IP for the balanced pool.

As usual, the explanation of what's going on here is quite a bit longer than the actual script. This might not be exactly what you're looking for, but hopefully it will give you an idea of where to go from here.

Posted by alexm at October 9, 2005 10:29 AM.
Send comments/suggestions to contact@moundalexis.com.
Add to del.icio.us | Digg this | Subscribe to this feed
Comments