Jan 4 2010

IE and RegExp.exec()

datchley

In some recent code, I’m using Javascript to parse through the result set of an AJAX call, which happens to be returning a full HTML page. Yes, ideally, I’d have an AJAX call return something usable like JSON, but in this case the PHP back-end code had to remain as is and the front-end adjust to handle the legacy HTML it returned.

I needed to grab a link (1 or more) from the returned HTML page so that I could immediately display those links in separate windows (each was a generated report).  So, my first stab at this is shown in the following code example. Basically, we have setup a string to represent the returned HTML, in this case it contains 3 <a> links; and we want to use the standard Javascript RegExp object’s exec() method to grab the URLS (href parameter) for each of those links.  In our example, we just print them out in an unordered list to see what we’ve captured. The important lines of code we’ll be looking at are highlighted in the example below.

var s='<a href="x">X</a>\n<a href="y">Y</a>\n<a href="z">Z</a>\n';
document.write('Found the following link URLs in the string:<br/><ul>');
while (matches = /<a href=['"](.*)['"]>.*<\/a>/g.exec(s)) {
  document.write('<li>' + matches[1] + '</li>\n');
}
document.write('</ul>');

Which, when run, we get the following results in Firefox/Safari/Chrome:

Found the following link URLs in the string:

  • x
  • y
  • z

Our while loop using RegExp.exec() on our in-line regular expression does what it’s supposed to and continues to match from where it left off in the string giving us our captured portion in the matches[] array.  However, when run in Internet Explorer, we get the following lovely result (at least up until IE tells us the script is no longer responding and asks us to kill it):

Found the following link URLs in the string:

  • x
  • x
  • x
  • x
  • x
  • x
  • x
  • x
  • x
  • …ad infinitum…

Obviously, we have generated an infinite loop using our code above in IE; but why?  The issue is that IE doesn’t correctly maintain the lastIndex member for the regular expression object each iteration through the loop.  Each time through the loop, which if you look at the highlighted code is in-lined, IE creates a new RegExp object and hence resets the lastIndex member to the beginning of the string. Therefore, we match the first link in the string infinitely as the lastIndex pointer never progresses between matches.  There is a way around this, and that is to declare the regular expression separately, outside the loop, (it gets created just once) and then call exec() on that singular RegExp object as follows:

var rx = /<a href=['"](.*)['"]>.*<\/a>/g;
var s='<a href="x">X</a>\n<a href="y">Y</a>\n<a href="z">Z</a>\n';
document.write('Found the following link URLs in the string:<br/><ul>');
while (matches = rx.exec(s)) {
  document.write('<li>' + matches[1] + '</li>\n');
}
document.write('</ul>');

Now, the lastIndex member of our RegExp object gets updated correctly and we get the results we expected.  Somewhat related to this item is the following interesting lastIndex bug in IE with zero-length matches.  Hopefully, this will save someone a headache when trying to debug using Javascript RegExp.exec().


Dec 28 2009

Block Record Processing in Perl

datchley

In some work I’m doing at the moment, I need to take a rather large data file (about 1.5 million records) and create a smaller sample data file to work with.  The larger, original file had records, one per line, and the file was already sorted in ASCII order. The file in question was the movies.list file from the open IMDB database.  It contains all the movies in IMDB’s database, sorted on the movie title.  What I wanted was a block of records from each alphabetical block (a-z) giving me a smaller sampling of the file for testing purposes.

Now, not only are there movies starting with the letters A through Z, but also titles starting with symbols, such as ‘#’ or ‘$’ as well as numbers 1-9 and so forth.  Quite a smattering of possibilities.  What I wanted to do was have a block of 20 records from each set, only based on the first character of the movie title.  Here’s the quick script I came up with to generate the sample data file:

#! /usr/bin/perl
use warnings;
use strict;

my $BLOCKSIZE = 20;
my $n = 0;

FILE:
while ($_ = <>)  {
	# get character that defines this block
	my $block = quotemeta substr($_, 0, 1);

	while (m/^$block/) {
		print if $n++ < $BLOCKSIZE;
		last FILE unless defined($_ = <>);
		# eat up records numbers over $BLOCKSIZE
	}
	print;			# first record of next block, so print it now
	$n = 1;			# reset our block counter to 1, not 0 (we got that one already)
}

The original file was rather large and I didn’t want to read the entire thing into memory, so I needed to process it in one pass. The script is small and fairly quick, but let’s look at some of the primary points.

       my $block = quotemeta substr($_, 0, 1);

Getting the first character of the first input line so that I know which block we’re starting. We’ll use this character in our regular expression as we go through the rest of the matching records in this block. The quotemeta part is there because of the non-alphanumeric characters I mentioned occurring in the titles before – it escapes them so we can then use them in the following regular expression.

	while (m/^$block/) {
		print if $n++ < $BLOCKSIZE;
		last FILE unless defined($_ = <>);
		# eat up records numbers over $BLOCKSIZE
	}

Here, we use that character in our expression to test the current line (which should obviously match) and keep cycling through lines until we hit a line that doesn’t start with the character we found previously (see line 15). Since we only want 20 records of each block for our sample, we only print the matching lines if the record count is less than our blocksize, using $n to keep track of our record count.  All other records that match the block but are OVER our block size, just get eaten up.

Lastly, after the end of our previous loop we’ll have a record in $_ that doesn’t match our previous pattern but will start the next pattern, we want the first 20 records for each block, so we’ll print this one here and restart our record counter at 1, instead of zero for the next block processing loop (lines 18 & 19).

You got Better Ideas? I’d love to hear how you would tackle this problem and even get some examples or short scripts from readers. Feel free to comment on my own script as I’d love to hear about ways to improve it!


Dec 15 2009

URL Hackery Followup: Regex

datchley

In a short follow up on my last post, I thought I’d cover briefly the regular expressions I’m using and provide some links to good references on regular expressions in Javascript. You can find a good introduction to basic regular expressions online, so I won’t belabor that summary here; and I’ll assume you are familiar with regex in any of the number of languages that support them, ie. Perl, PHP, Javascript, grep, et. al..

Here’s the first snippet from our previous post that uses regex:

var loc = /(?:foo(?:-([^\.]+)?)?)\.*/.exec(window.location.href);

What we’re trying to do is match just the leading part of the domain name, which we get from window.location.href.  For example, if the domain name of our current URL is foo.company.com, then we want to match ‘foo’ and have that returned to us in our array of matches in the variable loc.  In Javascript, exec() returns an array of matches, the first element of which is the part of the string that the pattern matched.  If there are capturing parenthesis, which we have here, then each captured portion will be in the following elements, in the order they are matched in the string.

In this case, some of our parenthesis are “non-capturing“, those that start with “(?:“.  So, we’d like to match a domain name that starts with the string “foo“, optionally followed by a ‘-‘ with one or more characters that are not periods in it, followed by a ‘.‘ (period) and zero or more of anything else (which would be the remaining part of the domain name).

The ‘?‘ question marks at the end of the parenthesis groupings means – match 0 or 1 of the preceding thing. This gives us the ability to make parts of the match optional, which is good, since in our case the production domain names don’t have a ‘-xxx‘ in the first part, only a ‘foo‘.  So, let’s break this down in a table and see what exec() would return to us for a number of different domain names:

url (string)

exec() returns

foo.company.com ["foo.", undefined]
foo-bar.company.com ["foo-bar.", "bar"]
bar.company.com null
foobar.company.com ["foo", undefined]

Notice in the last example, the full part of the string that exec matched for this pattern doesn't have the trailing '.' like the other examples. Remember, the first element in the array is the largest part of the string that the pattern could match completely, which in this case is just 'foo', since it's followed directly by a 'bar' instead of a period '.'.  Unless your quantifiers, like '+' and '*' are followed by a question mark, they are considered "greedy" and will match as much as they can while still satisfying the given pattern.

Hopefully this makes some sense.  If you want to play around with the above line of Javascript and you run Firebug, just open up the console and cut-paste the line above in the 'Run' area and try it out.  Firebug is a boon for Javascript development and I wouldn't go anywhere without it.

And for your further education, a link to some Javascript specific regular expression tutorials and examples.

Enjoy!


Dec 14 2009

Javascript URL hackery

datchley

In one of the projects I have at work we have a number of environments that run the system (a web based application) – production, testing environments and development.  Each environment is assigned it’s own domain name. So for instance, our Systems Integration Testing (SIT) environment would be accessed as foo-sit.company.com.  The environment is always suffixed at the end of the system name in the URL with a ‘-’ dash.  The only time this is not the case is in production, where the URL would just be foo.company.com. This is pretty straight forward and an easy way to access multiple web application environments.

However, there are times when I want the application running in one environment to hit the database of another environment. For instance, when running reports that I’m coding in development, I want to run them against a more recent copy of the database in our SIT environment. Our backend PHP code allows this by accepting a ‘dbenv=ENV‘ POST parameter to tell it to use a different database environment than its default. So, I could have my development environment reports hit the SIT testing environment database by changing the URL to

http://foo-dev.company.com?dbenv=SIT

So, in our Javascript code in a number of these reports, we need to determine if the user is coming in wanting a different database environment than the default and pass the dbenv=ENV parameter ourself via the AJAX call to generate the report. Since we’re using Javascript and AJAX, this has to be dynamic as we build the parameters for the report dynamically as well. Here’s the snippet of code that does that in Javascript land, with the names changed to protect the innocent:

var o = {};
o = Ext.urlDecode((window.location.href.split('?',2))[1]);

if (!o.dbenv || !o) {
 var loc = /(?:foo(?:-([^\.]+)?)?)\.*/.exec(window.location.href);
 if (loc.length == 2) {
     o.dbenv = (loc[1]) ? loc[1].toUpperCase() : 'PROD';
 }
 else {
     o.dbenv = 'DEV';
 }
}
else {
 o.dbenv = o.dbenv.replace(/[^A-Za-z]+/,'');
}

var DBENV = o.dbenv;

Our team is using ExtJS as a cross-browser Javascript framework for our applications, so the ‘Ext.urlDecode()‘ call is the only non-plain Javascript portion of this code. It returns the decoded URL parameters in a Javascript object. Note that we’re only decoding the parameter portion of the URL here ('window.location.href.split('?',2))[1], since we’re just trying to determine if there is already a ‘dbenv=ENV‘ attached to it. Once we have the parameter portion of the URL as an object, we look to see if there is a ‘dbenv’ attribute on this object. If not, or if the object is undefined (no parameters were available to parse) we need to look at the URL itself and the suffix we talked about before to determine the database environment to pass. That’s where we see

var loc = /(?:foo(?:-([^\.]+)?)?)\.*/.exec(window.location.href);

Here, we take the URL itself, looking for the system name (in this case ‘foo’) followed by a ‘-’ dash and some optional text which we hope to be the environment name in lower case (‘-dev’, ‘-sit’, etc.). Line 4 gives us back an array from the regular expression. This array will always have at least 1 element, which is the part of the URL that matched the pattern entirely, accessible via loc[0] at this point. If we matched a database environment, it will be in loc[1], the second element of the array.

The time when this is NOT true, is in production, which has no ‘-env’ suffix on the URL – in which case loc[1] will be ‘undefined’. In the code, I’ve never tested any occurrence when loc has a length of less than 2 (since in our regex match if there is no suffix, we always get ‘undefined’ back because of the capturing parenthesis). But, in case some weird URL rewriting happens because of our web server admins (and they do change things without telling us sometimes) we have a simple check and if we don’t get a length of 2, we simply default the database environment to DEV.

Otherwise, we set the database environment to whatever was matched in the suffix, uppercased or ‘PROD’ if it was undefined. This is fairly straight forward and has worked for us so far. But, if you’ve got other ideas or suggestions, I’d love to hear them in the comments.

Enjoy!