Dec 28 2009

Block Record Processing in Perl

datchley

In some work I’m doing at the moment, I need to take a rather large data file (about 1.5 million records) and create a smaller sample data file to work with.  The larger, original file had records, one per line, and the file was already sorted in ASCII order. The file in question was the movies.list file from the open IMDB database.  It contains all the movies in IMDB’s database, sorted on the movie title.  What I wanted was a block of records from each alphabetical block (a-z) giving me a smaller sampling of the file for testing purposes.

Now, not only are there movies starting with the letters A through Z, but also titles starting with symbols, such as ‘#’ or ‘$’ as well as numbers 1-9 and so forth.  Quite a smattering of possibilities.  What I wanted to do was have a block of 20 records from each set, only based on the first character of the movie title.  Here’s the quick script I came up with to generate the sample data file:

#! /usr/bin/perl
use warnings;
use strict;

my $BLOCKSIZE = 20;
my $n = 0;

FILE:
while ($_ = <>)  {
	# get character that defines this block
	my $block = quotemeta substr($_, 0, 1);

	while (m/^$block/) {
		print if $n++ < $BLOCKSIZE;
		last FILE unless defined($_ = <>);
		# eat up records numbers over $BLOCKSIZE
	}
	print;			# first record of next block, so print it now
	$n = 1;			# reset our block counter to 1, not 0 (we got that one already)
}

The original file was rather large and I didn’t want to read the entire thing into memory, so I needed to process it in one pass. The script is small and fairly quick, but let’s look at some of the primary points.

       my $block = quotemeta substr($_, 0, 1);

Getting the first character of the first input line so that I know which block we’re starting. We’ll use this character in our regular expression as we go through the rest of the matching records in this block. The quotemeta part is there because of the non-alphanumeric characters I mentioned occurring in the titles before – it escapes them so we can then use them in the following regular expression.

	while (m/^$block/) {
		print if $n++ < $BLOCKSIZE;
		last FILE unless defined($_ = <>);
		# eat up records numbers over $BLOCKSIZE
	}

Here, we use that character in our expression to test the current line (which should obviously match) and keep cycling through lines until we hit a line that doesn’t start with the character we found previously (see line 15). Since we only want 20 records of each block for our sample, we only print the matching lines if the record count is less than our blocksize, using $n to keep track of our record count.  All other records that match the block but are OVER our block size, just get eaten up.

Lastly, after the end of our previous loop we’ll have a record in $_ that doesn’t match our previous pattern but will start the next pattern, we want the first 20 records for each block, so we’ll print this one here and restart our record counter at 1, instead of zero for the next block processing loop (lines 18 & 19).

You got Better Ideas? I’d love to hear how you would tackle this problem and even get some examples or short scripts from readers. Feel free to comment on my own script as I’d love to hear about ways to improve it!


Dec 18 2009

chext: Batch rename file extensions

datchley

Most Linux systems come with a rename command today; but some of the commercial Unixes like AIX and HP-UX don’t have many of the command sets that Linux users have come to rely on.  This isn’t quite the same as rename, but it’s a script I put together a while back because changing the extension on a number of files was something I was doing quite frequently on those types of systems.  Here’s the script, feel free to copy/paste and use.

#!/bin/sh
# chext - batch rename files by changing the file extension
# Author: Dave Atchley <dave@tuxz0r.net>
#----------------------------------------------------------------------

usage() {
	echo "Usage: $0 [-R] OLD NEW"
	echo
	exit 1
}

case "$1" in
# -R recursive option
-R) 	if [ $# -ne 3 ]; then
		echo "error: missing arguments"; usage
	fi
	ext=$2
	new=$3
	files=`find . -name '*$2'`
	;;
-h)	echo "chext: change the extension on multiple files"; usage
	;;
*) 	if [ $# -ne 2 ]; then
		echo "error: missing arguments"; usage
	fi
	ext=$1
	new=$2
	files=`echo *$1`
	;;
esac

for f in $files
do
	mv $f `echo $f | sed 's/'"$ext"'$/'"$new"'/'`
done
exit 0

Basically the script will by default handle files in the current directory, but using the -R option will allow it to do so recursively.  You can call it from the command line as

$ chext .old .new

or recursively as

$ chext -R .old .new