Jan 20 2010

Simple PHP Profiling Class

datchley

Working in my current corporate environment we don’t have any useful PHP profiling abilities on any of our web servers. Unfortunately, trying to get our server support group to install something useful like APD or Xdebug and getting it integrated is a bureaucratic pain in the ass. We have some very useful tools like FireBug on the client side and we’ve started using FirePHP integration in FireBug which provides a number of nice features for debugging; but nothing specific to profiling. Given the situation, I decided to implement my own simple Profiling class that made use of FireBug/FirePHP to provide the output in a useful format. More could be done with this, to be sure; but for the time being it has taken care of my needs.

There’s no shortage of existing of profiling functions and classes by other developers on the Internet; but I figured I could use some creativity and play around with a class of my own. Plus, none of the one’s I’ve seen out there were using FireBug/FirePHP as part of their output which is something I wanted to try out. So, with some spare cycles while debugging, I came up with the fbProfile class (see download at end of the article).

The fbProfile class is very simple and straight forward. From a usage standpoint, it would provide only 3 methods: start(), stop() and display(). We’ll go into these in a bit more detail later; but, my goals for the class and it’s usage are summed up as follows:

  1. be able to nest calls to start() and stop() to produce a visible call chain
  2. be able to tag profiled tasks with a descriptive name for reference in the output
  3. track the total Page time (or close approximate)
  4. track how many times a block of profiled code was called, along with profiling information for each call
  5. optionally, be able to watch a set of variables on each subsequent call to a profiled block of code, seeing the values over time
  6. determine some basic results like total task time, average task time (per call), percentage of total page time, etc.
  7. show the output using FireBug’s console using FirePHP to take advantage of the work area and abilities it provides.

Using the fbProfile Class

Because we’re using FireBug/FirePHP you’ll need to have both of these installed and enabled for the site you’re using fbProfile on. At the top of your PHP code, do the usual setup for FirePHP and afterwards include the fbProfile.php file.

<?php
	ob_start();
	require_once($_SERVER['DOCUMENT_ROOT']."/firephp/lib/FirePHPCore/FirePHP.class.php");
	$firephp = new FirePHP;
	$firephp->registerErrorHandler();
	$firephp->registerExceptionHandler();

	include_once($_SERVER['DOCUMENT_ROOT'] . "/fbProfile.php");

Including the fbProfile.php class will be the marker for the page start timestamp it uses to keep track of total page load/processing time. It’s not as accurate as when using a real profiler like APD or Xdebug, but if you put it close to the top of the page (which is where it would normally fall to being with) it’s an accurate enough approximation of page start time.

From this point, just wrap the sections or blocks of code you would like to benchmark in calls to start() and stop() respectively. A call to start takes a string name which will be used as the label for that section of code when we output the results to the FireBug console, so use something descriptive. Also, these calls can be nested, as such:

	...
	fbProfile::start('Main Query Execution');
	$rs = $tdcsdb->Execute($sql);
	fbProfile::stop();
	...
	fbProfile::start('Main Processing Loop');
	while ($row = $rs->FetchRow()) {
		...
		fbProfile::start('Build Record');
		...
		update_output_file($row);
		...
		fbProfile::stop();	// Build Record
	}
	fbProfile::stop();	// Main Processing Loop

	...

	function update_output_file($rec) {
		fbProfile::start('update_output_file()');
		...
		fbProfile::stop();
	}

This is simple usage shown above. Each call to start() has a matching call to end() and each wrapped block/section of code is given a descriptive name. We have nested calls and even calls inside a function to profile function execution. fbProfile will track each call to the wrapped blocks of code and keep track of processing time per call and in relation to the entire page processing time.

At the end of your PHP script, do the necessary FirePHP close up because of output buffering and call fbProfile::display() to show the output of the profiling in your FireBug console. fbProfile::display takes a reference to your FirePHP instance object – the one that we created at the top of the page.

	fbProfile::display($firephp);
	ob_end_flush();
?>

Optionally, if you would like to see the individual profiling times for each call to a section of code and/or track one or more variable values related to that block of code, fbProfile::start() takes 2 optional parameters after the string label. The first is a boolean true or false to tell the fbProfile class that you would like to see this detailed information in the output when display() is called (defaults to false) and after that parameter you can pass a variable number of other parameters that are the variables you would like to see the values of during each subsequent call (shown in the detail records). So, to see the detail and watch some values, call fbProfile::start() as follows:

fbProfile::start('My Label', true, $var1, $var2, ...);

Below is some output from a run of fbProfile on some code I have been working on, the detail records are collapsed in this screenshot. If a task has detail records, you’ll see the little ‘table’ icon on the left side of the task summary, since we’re using FirePHP’s table() function to format the output.

And the next picture shows the output display of detailed call records for a task that was called numerous times. Just click the task summary with the table icon to expand the detail records display.

Hopefully, this class will provide useful for someone as-is, or even as a jumping off point in creating their own. Again, this is not hardened code and definitely not as clean as it could be; but it’s the kind of thing that developers do when they need a fix and are limited on resources.

Download link: fbProfile.php

Enjoy!


Jan 6 2010

Submitting an Ext.form.Checkbox when unchecked

datchley

ExtJS is a great Javascript UI library and API. In using ExtJS for UI components to create forms, the Ext checkbox widget (Ext.form.Checkbox) mimics the behavior of standard HTML checkboxes in that if they are not checked, they are not submitted via POST/GET on the form. This makes sense in some instances, in that if I don’t set a value for a given checkbox field I probably don’t want it on the back-end – except when that checkbox represents an on/off type value and that value, whether on or off, needs to be updated in a database record somewhere.

So, what do you do in those instances where you want to know if a field value has been toggled off, and not just on? You could cycle through those particular boolean fields for each record you want to update, checking if the existing value is set on the record; and if that field isn’t present in the form POST data unset it explicitly. However, that’s a lot of work and a lot of hard-coding of boolean indicator fields on your back-end. A better solution would be if we could actually make those stubborn checkboxes send a value on form submission even when they aren’t checked! And lo and behold, with ExtJS’s Ext.override() functionality for classes, we can do that with the standard Ext.form.Checkbox component.

In the code below, we use Ext.override() to modify the basic Ext.form.Checkbox component to include a hidden input field that will contain the “off” or “unchecked” value that we want submitted via our forms. However, in order to avoid sending duplicate fields when the primary checkbox is checked (we don’t want to send an “on” and an “off” value) we remove that hidden field each time the checkbox is checked by overriding the setValue() method of Ext.form.Checkbox.

We only need to override two methods: Ext.form.Checkbox.onRender() and Ext.form.Checkbox.setValue() to handle all the cases we need. Here’s the code…

(function() {
	origCheckboxRender = Ext.form.Checkbox.prototype.onRender;
	origCheckboxSetValue = Ext.form.Checkbox.prototype.setValue;

	Ext.override(Ext.form.Checkbox, {
		onRender: function() {
			// call the original onRender() function
			origCheckboxRender.apply(this, arguments);

			// Handle initial case based on this.checked
			if (this.checked == false) {
				this.noValEl = Ext.DomHelper.insertAfter(this.el, {
            		tag: 'input',
            		type: 'hidden',
            		value: '0',
            		name: this.getName()
        		}, true);
			}
			else {
				this.noValEl = null;
			}
		},
		setValue: function() {
			// call original setValue() function
			origCheckboxSetValue.apply(this, arguments);
			if (this.checked) {
				if (this.noValEl != null) {
					// Remove the extra hidden element
					Ext.select('input[id=' + this.noValEl.id + ']').remove();
					this.noValEl = null;
				}
			}
			else {
				// Add our hidden element for (unchecked) value
				this.noValEl = Ext.DomHelper.insertAfter(this.el, {
            		tag: 'input',
            		type: 'hidden',
            		value: '0',
            		name: this.getName()
        		}, true);
			}
		}
  });
})();

Important to note here, and not completely necessary, is that we’re using a closure by wrapping our override in an anonymous function and immediately calling it to run the override. This simply keeps us from polluting the global namespace; and once the override is done there’s no need to reference any of those objects again anyway.

Before calling Ext.override(), we save the original functions for onRender() and setValue(), since we don’t want to lose any existing Checkbox functionality they provide, we simply want to make them do a bit more. We could have chosen to use Ext.extend() here and create an entirely new type of Checkbox that handled this functionality. But, in my case on this project, that would have impacted too many other applications that already depended on the original Checkbox functionality; and, we didn’t want to go through all of our code and change every occurrence of Checkbox to SmartCheckbox or some such. Too much regression testing at this point. So, we choose to override instead of extend – but if you have the opportunity, extending might be the better option and provide you with a way to set the “unchecked” value you want for each Checkbox instance.

So, in each overridden function we simply call the original function first to retain existing functionality and then add our bits to extend the object. We use Ext.DomHelper.insertAfter() to handle inserting the new hidden form field and it has the same name as the actual Checkbox field. If the checkbox starts off initially checked, we add the hidden field; and if not we simply set our noValEl variable to null. Same for setValue(), which is called each time the checkbox is “checked.”

This is the kind of simple solution that the ExtJS library allows you to perform and which makes it a very extendable API for cross-browser Javascript UI development. Enjoy!


Jan 4 2010

IE and RegExp.exec()

datchley

In some recent code, I’m using Javascript to parse through the result set of an AJAX call, which happens to be returning a full HTML page. Yes, ideally, I’d have an AJAX call return something usable like JSON, but in this case the PHP back-end code had to remain as is and the front-end adjust to handle the legacy HTML it returned.

I needed to grab a link (1 or more) from the returned HTML page so that I could immediately display those links in separate windows (each was a generated report).  So, my first stab at this is shown in the following code example. Basically, we have setup a string to represent the returned HTML, in this case it contains 3 <a> links; and we want to use the standard Javascript RegExp object’s exec() method to grab the URLS (href parameter) for each of those links.  In our example, we just print them out in an unordered list to see what we’ve captured. The important lines of code we’ll be looking at are highlighted in the example below.

var s='<a href="x">X</a>\n<a href="y">Y</a>\n<a href="z">Z</a>\n';
document.write('Found the following link URLs in the string:<br/><ul>');
while (matches = /<a href=['"](.*)['"]>.*<\/a>/g.exec(s)) {
  document.write('<li>' + matches[1] + '</li>\n');
}
document.write('</ul>');

Which, when run, we get the following results in Firefox/Safari/Chrome:

Found the following link URLs in the string:

  • x
  • y
  • z

Our while loop using RegExp.exec() on our in-line regular expression does what it’s supposed to and continues to match from where it left off in the string giving us our captured portion in the matches[] array.  However, when run in Internet Explorer, we get the following lovely result (at least up until IE tells us the script is no longer responding and asks us to kill it):

Found the following link URLs in the string:

  • x
  • x
  • x
  • x
  • x
  • x
  • x
  • x
  • x
  • …ad infinitum…

Obviously, we have generated an infinite loop using our code above in IE; but why?  The issue is that IE doesn’t correctly maintain the lastIndex member for the regular expression object each iteration through the loop.  Each time through the loop, which if you look at the highlighted code is in-lined, IE creates a new RegExp object and hence resets the lastIndex member to the beginning of the string. Therefore, we match the first link in the string infinitely as the lastIndex pointer never progresses between matches.  There is a way around this, and that is to declare the regular expression separately, outside the loop, (it gets created just once) and then call exec() on that singular RegExp object as follows:

var rx = /<a href=['"](.*)['"]>.*<\/a>/g;
var s='<a href="x">X</a>\n<a href="y">Y</a>\n<a href="z">Z</a>\n';
document.write('Found the following link URLs in the string:<br/><ul>');
while (matches = rx.exec(s)) {
  document.write('<li>' + matches[1] + '</li>\n');
}
document.write('</ul>');

Now, the lastIndex member of our RegExp object gets updated correctly and we get the results we expected.  Somewhat related to this item is the following interesting lastIndex bug in IE with zero-length matches.  Hopefully, this will save someone a headache when trying to debug using Javascript RegExp.exec().


Dec 31 2009

Using what’s available

datchley

I’ve seen a lot of questions from new and beginning programmers on various forums and other online communities. A lot of them, albeit, geared toward a particular language, like shell programming, Perl or C; but generally all them are in the Unix variety of environments, primarily Linux. Many times, they are asking how to do a given task in a given language, when that task might be better done in other ways. Yes, there are times where you’ve been hand-cuffed to a particular language or operating environment; but there are generally 1001 ways to skin a cat in Unix and people tend to forget this as they’re picking up a language like Perl, Python, Ruby and others. If you are writing scripts and programs on Unix that operate on the back-end of things — i.e. not CGI or web related and/or dealing with GUI interfaces — you are almost always better off combining your primary language and those smaller scripting languages and other tools that Unix provides for developers to make your life easier, and help keep your code manageable.

In our example, someone on a Perl forum was asking about how to copy a set of files from a directory and it’s sub directories to another single directory. All the files were gzipped, with a .gz extension. Doing this in Perl using File::Find or some other such feature is a bit of overkill for this given task. The Unix shell and accompanying commands it provides can handle this with less typing.

$ find . -name '*.gz' | xargs -I {} mv {} /new/parent/folder/

This is very quick and can even be called from the given Perl script via system() or backticks. Otherwise, doing it in Perl with File::Find would take a number of lines to find the matching files in the given directory tree and then you’d still have to write the Perl code to move the files over to the new directory. Just call the above snippet from your Perl script, check the return results and move on. No need to reinvent the wheel.

For reference, here’s the equivalent operation using Perl only:

#! /usr/bin/perl

use File::Find;
use File::Copy;

my $target = "/new/parent/folder";

find (\&filemove, "./");
exit(0);

sub filemove() {
        return unless -f $File::Find::name && m/\.gz$/;
        move($_, \"$target/$file\");
}

That’s much more coding. And honestly, you’re not saving yourself much portability unless you plan on running this in a non-Unix environment as well. Remember, there’s no problem combining different types of languages or scripts to accomplish the given task. That’s what all those tools are there for to begin with!

Linux/Unix provide developers with a deep and rich tool-set of commands and utilities which they should take advantage of in solving problems.  It was a design goal of Unix to provide discrete commands and tools with very specific functionality that could be combined to solve more complex problems – so why not extend that paradigm to include all the great modern languages and tools that we all use today as well.

Enjoy!


Dec 28 2009

Block Record Processing in Perl

datchley

In some work I’m doing at the moment, I need to take a rather large data file (about 1.5 million records) and create a smaller sample data file to work with.  The larger, original file had records, one per line, and the file was already sorted in ASCII order. The file in question was the movies.list file from the open IMDB database.  It contains all the movies in IMDB’s database, sorted on the movie title.  What I wanted was a block of records from each alphabetical block (a-z) giving me a smaller sampling of the file for testing purposes.

Now, not only are there movies starting with the letters A through Z, but also titles starting with symbols, such as ‘#’ or ‘$’ as well as numbers 1-9 and so forth.  Quite a smattering of possibilities.  What I wanted to do was have a block of 20 records from each set, only based on the first character of the movie title.  Here’s the quick script I came up with to generate the sample data file:

#! /usr/bin/perl
use warnings;
use strict;

my $BLOCKSIZE = 20;
my $n = 0;

FILE:
while ($_ = <>)  {
	# get character that defines this block
	my $block = quotemeta substr($_, 0, 1);

	while (m/^$block/) {
		print if $n++ < $BLOCKSIZE;
		last FILE unless defined($_ = <>);
		# eat up records numbers over $BLOCKSIZE
	}
	print;			# first record of next block, so print it now
	$n = 1;			# reset our block counter to 1, not 0 (we got that one already)
}

The original file was rather large and I didn’t want to read the entire thing into memory, so I needed to process it in one pass. The script is small and fairly quick, but let’s look at some of the primary points.

       my $block = quotemeta substr($_, 0, 1);

Getting the first character of the first input line so that I know which block we’re starting. We’ll use this character in our regular expression as we go through the rest of the matching records in this block. The quotemeta part is there because of the non-alphanumeric characters I mentioned occurring in the titles before – it escapes them so we can then use them in the following regular expression.

	while (m/^$block/) {
		print if $n++ < $BLOCKSIZE;
		last FILE unless defined($_ = <>);
		# eat up records numbers over $BLOCKSIZE
	}

Here, we use that character in our expression to test the current line (which should obviously match) and keep cycling through lines until we hit a line that doesn’t start with the character we found previously (see line 15). Since we only want 20 records of each block for our sample, we only print the matching lines if the record count is less than our blocksize, using $n to keep track of our record count.  All other records that match the block but are OVER our block size, just get eaten up.

Lastly, after the end of our previous loop we’ll have a record in $_ that doesn’t match our previous pattern but will start the next pattern, we want the first 20 records for each block, so we’ll print this one here and restart our record counter at 1, instead of zero for the next block processing loop (lines 18 & 19).

You got Better Ideas? I’d love to hear how you would tackle this problem and even get some examples or short scripts from readers. Feel free to comment on my own script as I’d love to hear about ways to improve it!


Dec 18 2009

chext: Batch rename file extensions

datchley

Most Linux systems come with a rename command today; but some of the commercial Unixes like AIX and HP-UX don’t have many of the command sets that Linux users have come to rely on.  This isn’t quite the same as rename, but it’s a script I put together a while back because changing the extension on a number of files was something I was doing quite frequently on those types of systems.  Here’s the script, feel free to copy/paste and use.

#!/bin/sh
# chext - batch rename files by changing the file extension
# Author: Dave Atchley <dave@tuxz0r.net>
#----------------------------------------------------------------------

usage() {
	echo "Usage: $0 [-R] OLD NEW"
	echo
	exit 1
}

case "$1" in
# -R recursive option
-R) 	if [ $# -ne 3 ]; then
		echo "error: missing arguments"; usage
	fi
	ext=$2
	new=$3
	files=`find . -name '*$2'`
	;;
-h)	echo "chext: change the extension on multiple files"; usage
	;;
*) 	if [ $# -ne 2 ]; then
		echo "error: missing arguments"; usage
	fi
	ext=$1
	new=$2
	files=`echo *$1`
	;;
esac

for f in $files
do
	mv $f `echo $f | sed 's/'"$ext"'$/'"$new"'/'`
done
exit 0

Basically the script will by default handle files in the current directory, but using the -R option will allow it to do so recursively.  You can call it from the command line as

$ chext .old .new

or recursively as

$ chext -R .old .new


Dec 17 2009

Staging Unit Tests using rsync

datchley

Everyone has their own development process and tools.  In most software processes, though, there is a spot for “unit testing” by the developer on the given feature or defect they are working on.  Since I’m primarily involved in web development using PHP these days, this involves only a couple of steps:

  1. design and code the feature/bug fix
  2. push those changes out to our development server and test

The second part is what we’ll address in this article – namely, how I currently get the files I’m coding up to the development environment for testing in an efficient manner.  Warning, if you aren’t using a scripting language like Ruby or PHP, such as Java or something else that needs compiling, then this method will not work for you. However, feel free to continue reading.

As a developer, I want to be able to quickly push my coding changes out for testing during the course of development. This lets me break the given feature or bug fix down into more digestible parts and get continuous feedback on how my design is holding up and whether I need to make any changes or corrections.  Unit testing is not final “Acceptance Testing” of a feature, but a way to ensure that the given code builds and runs without any glaring errors or faults.

At our office we distribute our web applications here much like the open source community distributes their software – using GNU autotools to build a tar.gz package.  This makes installs simple, especially since another group in our company does the actual installations (and frankly, sometimes even this method isn’t simple enough for them, but I won’t mention any names). However, if I’m needing to push files out quickly to development so I can unit test, I don’t want to take the time to build a completely new autotools package of the system I’m working on just to install 1 or 2 files with a few code line changes.  Seems like overkill. Remember, good coders are inherently lazy. I’m pretty sure that’s a Larry Wall quote. I’d link you to it, but you can google just as well as me. ;-)

Luckily, for a number of our systems  the code in our code repository  is structured just like the actual installation would be.  So, if I was working on a project we might have a php/, js/, css/, img/ and other assorted directories in our repository and this is exactly the same structure we’d have in our installation.  Since we’re setup this way, I can easily script the pushing of files out to development for testing without worrying about making a brand new package using a command called rsync.

If you are familiar with ssh or rsh — and you like them — you will love rsync. The rsync command is installed by default on most Linux systems, but if your flavor doesn’t have it it is free to download and install – and the great thing is you don’t have to be root to install it or get it working.  You will however, have to have it installed on each machine you want to rsync between.  In most cases, rsync will use ssh for the remote file transfers, but this can be setup differently if you want.  This means, that if you pass around ssh keys to your servers, then rsync will take advantage of that when it’s copying files.

The important things to note about rsync from our standpoint are that it allows us to copy files to a remote server, takes advantage of ssh (which we like) and it does so quickly using a delta style algorithm.  Basically, we’re using rsync to “mirror” my working development folder directly on the server.  For one of my projects, here is the “sync” shell script that I use to push files out for testing using this method (names changed to protect the innocent):

#!/bin/bash
#----------------------------------------------------------------------

# Default target to something useful
target=$1
TARGET=${target:=username@dev.company.com:/home/username/public_html/exh/}

EXFILE=/tmp/excludefiles.$$
cat >$EXFILE <<EOF
- configure
- Makefile
- **.am
- **.in
- **.cache
- **.log
- .git/***
- m4/***
- build-aux/***
- ext-2.0.2/***
- ext-2.1/***
- sql/***
EOF

echo "Syncing files to location: $TARGET .........."
if rsync --exclude-from=$EXFILE --delete -ravve ssh ./ $TARGET 2>&1; then
 echo "ok"
else
 echo "failed"
fi

rm -f $EXFILE
exit 0

Let’s talk about this in a bit more detail.  The rsync command takes a destination parameter much like ssh and scp, of the form

username@host:/path/to/file/or/folder

This is what we setup in the beginning, allowing me to pass in an arbitrary destination on the command line for the script or, without one it defaults to pushing the files out to my development environment.  I should also mention that this script is in the top level of my working code repository so when I run it from there it will copy my entire directory structure using rsync.

Now that we have a destination target for rsync to use, we also want to tell rsync to ignore certain files and directories. In my case, I don’t want to copy any of the autotools related files (Makefile.am, configure.in, etc.) or certain subdirectories which are only development and configuration related and not actual working code.  We do this by creating a listing of the items we want to exclude in a temporary file.  Here’s a synopsis of the syntax we use to list those file and file/directory name patterns for rsync.  The ‘-’ at the beginning of the line tells rsync to ‘exclude’ the file during syncing, and a ‘+’ would be the opposite.

  • if the pattern starts with a / then it is anchored to a particular spot in the hierarchy of files, otherwise it is matched against the end of the pathname. This is similar to a leading ^ in regular expressions. Thus “/foo” would match a name of “foo” at either the “root of the transfer” (for a global rule) or in the merge-file’s directory (for a per-directory rule).
  • if the pattern ends with a / then it will only match a directory, not a regular file, symlink, or device.
  • rsync chooses between doing a simple string match and wildcard matching by checking if the pattern contains one of these three wildcard characters: ‘*’, ‘?’, and ‘[' .
  • a '*' matches any path component, but it stops at slashes.
  • use '**' to match anything, including slashes.
  • a '?' matches any character except a slash (/).
  • a '[' introduces a character class, such as [a-z] or [[:alpha:]].
  • in a wildcard pattern, a backslash can be used to escape a wildcard character, but it is matched literally when no wildcards are present.
  • if the pattern contains a / (not counting a trailing /) or a “**”, then it is matched against the full pathname, including any leading directories. If the pattern doesn’t contain a / or a “**”, then it is matched only against the final component of the filename. (Remember that the algorithm is applied recursively so “full filename” can actually be any portion of a path from the starting directory on down.)
  • a trailing “dir_name/***” will match both the directory (as if “dir_name/” had been specified) and everything in the directory (as if “dir_name/**” had been specified). This behavior was added in version 2.6.7.

Once we have this file setup, we simply call the rsync command telling it our target and passing in our exclude list.  Then, we remove the temporary exclude list we built and we’re done.  The options I’m using on rsync here are as follows:

  • --delete – I want rsync to remove any files on the “receiving side” that aren’t on my “sending side.”
  • --recurse or -r : I want rsync to recurse into directories and subdirectories to make the copy. duh?
  • --archive or -a : I want to preserve ALL properties of the files I’m copying, perms, owner, times, etc. (use to taste)
  • --verbose or -v : I use this twice, as the more I specify the more gratuitous rsync’s output on what it’s doing will be.
  • --rsh or -e : specify the remote shell to use when copying (we want ssh)

Again, rsync has more options than I could shake a stick at so please check out the man page and do some reading.

In using rsync, I’ve given myself a quick and easily configurable way to copy code and files out to an environment for testing without having to build entire packages.  Hopefully this is useful to you in your case; but, given that your working repository might not mirror the structure of your actual installation this might not be best for your situation.  This is just what works for me, on my current projects; and I’m sure that will change in the future too.


Dec 15 2009

URL Hackery Followup: Regex

datchley

In a short follow up on my last post, I thought I’d cover briefly the regular expressions I’m using and provide some links to good references on regular expressions in Javascript. You can find a good introduction to basic regular expressions online, so I won’t belabor that summary here; and I’ll assume you are familiar with regex in any of the number of languages that support them, ie. Perl, PHP, Javascript, grep, et. al..

Here’s the first snippet from our previous post that uses regex:

var loc = /(?:foo(?:-([^\.]+)?)?)\.*/.exec(window.location.href);

What we’re trying to do is match just the leading part of the domain name, which we get from window.location.href.  For example, if the domain name of our current URL is foo.company.com, then we want to match ‘foo’ and have that returned to us in our array of matches in the variable loc.  In Javascript, exec() returns an array of matches, the first element of which is the part of the string that the pattern matched.  If there are capturing parenthesis, which we have here, then each captured portion will be in the following elements, in the order they are matched in the string.

In this case, some of our parenthesis are “non-capturing“, those that start with “(?:“.  So, we’d like to match a domain name that starts with the string “foo“, optionally followed by a ‘-‘ with one or more characters that are not periods in it, followed by a ‘.‘ (period) and zero or more of anything else (which would be the remaining part of the domain name).

The ‘?‘ question marks at the end of the parenthesis groupings means – match 0 or 1 of the preceding thing. This gives us the ability to make parts of the match optional, which is good, since in our case the production domain names don’t have a ‘-xxx‘ in the first part, only a ‘foo‘.  So, let’s break this down in a table and see what exec() would return to us for a number of different domain names:

url (string)

exec() returns

foo.company.com ["foo.", undefined]
foo-bar.company.com ["foo-bar.", "bar"]
bar.company.com null
foobar.company.com ["foo", undefined]

Notice in the last example, the full part of the string that exec matched for this pattern doesn't have the trailing '.' like the other examples. Remember, the first element in the array is the largest part of the string that the pattern could match completely, which in this case is just 'foo', since it's followed directly by a 'bar' instead of a period '.'.  Unless your quantifiers, like '+' and '*' are followed by a question mark, they are considered "greedy" and will match as much as they can while still satisfying the given pattern.

Hopefully this makes some sense.  If you want to play around with the above line of Javascript and you run Firebug, just open up the console and cut-paste the line above in the 'Run' area and try it out.  Firebug is a boon for Javascript development and I wouldn't go anywhere without it.

And for your further education, a link to some Javascript specific regular expression tutorials and examples.

Enjoy!


Dec 14 2009

Javascript URL hackery

datchley

In one of the projects I have at work we have a number of environments that run the system (a web based application) – production, testing environments and development.  Each environment is assigned it’s own domain name. So for instance, our Systems Integration Testing (SIT) environment would be accessed as foo-sit.company.com.  The environment is always suffixed at the end of the system name in the URL with a ‘-’ dash.  The only time this is not the case is in production, where the URL would just be foo.company.com. This is pretty straight forward and an easy way to access multiple web application environments.

However, there are times when I want the application running in one environment to hit the database of another environment. For instance, when running reports that I’m coding in development, I want to run them against a more recent copy of the database in our SIT environment. Our backend PHP code allows this by accepting a ‘dbenv=ENV‘ POST parameter to tell it to use a different database environment than its default. So, I could have my development environment reports hit the SIT testing environment database by changing the URL to

http://foo-dev.company.com?dbenv=SIT

So, in our Javascript code in a number of these reports, we need to determine if the user is coming in wanting a different database environment than the default and pass the dbenv=ENV parameter ourself via the AJAX call to generate the report. Since we’re using Javascript and AJAX, this has to be dynamic as we build the parameters for the report dynamically as well. Here’s the snippet of code that does that in Javascript land, with the names changed to protect the innocent:

var o = {};
o = Ext.urlDecode((window.location.href.split('?',2))[1]);

if (!o.dbenv || !o) {
 var loc = /(?:foo(?:-([^\.]+)?)?)\.*/.exec(window.location.href);
 if (loc.length == 2) {
     o.dbenv = (loc[1]) ? loc[1].toUpperCase() : 'PROD';
 }
 else {
     o.dbenv = 'DEV';
 }
}
else {
 o.dbenv = o.dbenv.replace(/[^A-Za-z]+/,'');
}

var DBENV = o.dbenv;

Our team is using ExtJS as a cross-browser Javascript framework for our applications, so the ‘Ext.urlDecode()‘ call is the only non-plain Javascript portion of this code. It returns the decoded URL parameters in a Javascript object. Note that we’re only decoding the parameter portion of the URL here ('window.location.href.split('?',2))[1], since we’re just trying to determine if there is already a ‘dbenv=ENV‘ attached to it. Once we have the parameter portion of the URL as an object, we look to see if there is a ‘dbenv’ attribute on this object. If not, or if the object is undefined (no parameters were available to parse) we need to look at the URL itself and the suffix we talked about before to determine the database environment to pass. That’s where we see

var loc = /(?:foo(?:-([^\.]+)?)?)\.*/.exec(window.location.href);

Here, we take the URL itself, looking for the system name (in this case ‘foo’) followed by a ‘-’ dash and some optional text which we hope to be the environment name in lower case (‘-dev’, ‘-sit’, etc.). Line 4 gives us back an array from the regular expression. This array will always have at least 1 element, which is the part of the URL that matched the pattern entirely, accessible via loc[0] at this point. If we matched a database environment, it will be in loc[1], the second element of the array.

The time when this is NOT true, is in production, which has no ‘-env’ suffix on the URL – in which case loc[1] will be ‘undefined’. In the code, I’ve never tested any occurrence when loc has a length of less than 2 (since in our regex match if there is no suffix, we always get ‘undefined’ back because of the capturing parenthesis). But, in case some weird URL rewriting happens because of our web server admins (and they do change things without telling us sometimes) we have a simple check and if we don’t get a length of 2, we simply default the database environment to DEV.

Otherwise, we set the database environment to whatever was matched in the suffix, uppercased or ‘PROD’ if it was undefined. This is fairly straight forward and has worked for us so far. But, if you’ve got other ideas or suggestions, I’d love to hear them in the comments.

Enjoy!