Dec 31 2009

Using what’s available

datchley

I’ve seen a lot of questions from new and beginning programmers on various forums and other online communities. A lot of them, albeit, geared toward a particular language, like shell programming, Perl or C; but generally all them are in the Unix variety of environments, primarily Linux. Many times, they are asking how to do a given task in a given language, when that task might be better done in other ways. Yes, there are times where you’ve been hand-cuffed to a particular language or operating environment; but there are generally 1001 ways to skin a cat in Unix and people tend to forget this as they’re picking up a language like Perl, Python, Ruby and others. If you are writing scripts and programs on Unix that operate on the back-end of things — i.e. not CGI or web related and/or dealing with GUI interfaces — you are almost always better off combining your primary language and those smaller scripting languages and other tools that Unix provides for developers to make your life easier, and help keep your code manageable.

In our example, someone on a Perl forum was asking about how to copy a set of files from a directory and it’s sub directories to another single directory. All the files were gzipped, with a .gz extension. Doing this in Perl using File::Find or some other such feature is a bit of overkill for this given task. The Unix shell and accompanying commands it provides can handle this with less typing.

$ find . -name '*.gz' | xargs -I {} mv {} /new/parent/folder/

This is very quick and can even be called from the given Perl script via system() or backticks. Otherwise, doing it in Perl with File::Find would take a number of lines to find the matching files in the given directory tree and then you’d still have to write the Perl code to move the files over to the new directory. Just call the above snippet from your Perl script, check the return results and move on. No need to reinvent the wheel.

For reference, here’s the equivalent operation using Perl only:

#! /usr/bin/perl

use File::Find;
use File::Copy;

my $target = "/new/parent/folder";

find (\&filemove, "./");
exit(0);

sub filemove() {
        return unless -f $File::Find::name && m/\.gz$/;
        move($_, \"$target/$file\");
}

That’s much more coding. And honestly, you’re not saving yourself much portability unless you plan on running this in a non-Unix environment as well. Remember, there’s no problem combining different types of languages or scripts to accomplish the given task. That’s what all those tools are there for to begin with!

Linux/Unix provide developers with a deep and rich tool-set of commands and utilities which they should take advantage of in solving problems.  It was a design goal of Unix to provide discrete commands and tools with very specific functionality that could be combined to solve more complex problems – so why not extend that paradigm to include all the great modern languages and tools that we all use today as well.

Enjoy!


Dec 28 2009

Block Record Processing in Perl

datchley

In some work I’m doing at the moment, I need to take a rather large data file (about 1.5 million records) and create a smaller sample data file to work with.  The larger, original file had records, one per line, and the file was already sorted in ASCII order. The file in question was the movies.list file from the open IMDB database.  It contains all the movies in IMDB’s database, sorted on the movie title.  What I wanted was a block of records from each alphabetical block (a-z) giving me a smaller sampling of the file for testing purposes.

Now, not only are there movies starting with the letters A through Z, but also titles starting with symbols, such as ‘#’ or ‘$’ as well as numbers 1-9 and so forth.  Quite a smattering of possibilities.  What I wanted to do was have a block of 20 records from each set, only based on the first character of the movie title.  Here’s the quick script I came up with to generate the sample data file:

#! /usr/bin/perl
use warnings;
use strict;

my $BLOCKSIZE = 20;
my $n = 0;

FILE:
while ($_ = <>)  {
	# get character that defines this block
	my $block = quotemeta substr($_, 0, 1);

	while (m/^$block/) {
		print if $n++ < $BLOCKSIZE;
		last FILE unless defined($_ = <>);
		# eat up records numbers over $BLOCKSIZE
	}
	print;			# first record of next block, so print it now
	$n = 1;			# reset our block counter to 1, not 0 (we got that one already)
}

The original file was rather large and I didn’t want to read the entire thing into memory, so I needed to process it in one pass. The script is small and fairly quick, but let’s look at some of the primary points.

       my $block = quotemeta substr($_, 0, 1);

Getting the first character of the first input line so that I know which block we’re starting. We’ll use this character in our regular expression as we go through the rest of the matching records in this block. The quotemeta part is there because of the non-alphanumeric characters I mentioned occurring in the titles before – it escapes them so we can then use them in the following regular expression.

	while (m/^$block/) {
		print if $n++ < $BLOCKSIZE;
		last FILE unless defined($_ = <>);
		# eat up records numbers over $BLOCKSIZE
	}

Here, we use that character in our expression to test the current line (which should obviously match) and keep cycling through lines until we hit a line that doesn’t start with the character we found previously (see line 15). Since we only want 20 records of each block for our sample, we only print the matching lines if the record count is less than our blocksize, using $n to keep track of our record count.  All other records that match the block but are OVER our block size, just get eaten up.

Lastly, after the end of our previous loop we’ll have a record in $_ that doesn’t match our previous pattern but will start the next pattern, we want the first 20 records for each block, so we’ll print this one here and restart our record counter at 1, instead of zero for the next block processing loop (lines 18 & 19).

You got Better Ideas? I’d love to hear how you would tackle this problem and even get some examples or short scripts from readers. Feel free to comment on my own script as I’d love to hear about ways to improve it!