Reputation: 71

Trying to simplify a Regex

I'm spending my weekend analyzing Campaign Finance Contribution records. Fun!

One of the annoying things I've noticed is that entity names are entered differently:

For example, i see stuff like this: 'llc', 'llc.', 'l l c', 'l.l.c', 'l. l. c.', 'llc,', etc.

I'm trying to catch all these variants.

So it would be something like:

"l([,\.\ ]*)l([,\.\ ]*)c([,\.\ ]*)"

Which isn't so bad... except there are about 40 entity suffixes that I can think of.

The best thing I can think of is programmatically building up this pattern , based on my list of suffixes.

I'm wondering if there's a better way to handle this within a single regex that is human readable/writable.

Upvotes: 1

Answers (6)

James Thompson

Reputation: 48222

In Perl you can build up regular expressions inside your program using strings. Here's some example code:

#!/usr/bin/perl

use strict;
use warnings;

my @strings = (
    "l.l.c",
    "llc",
    "LLC",
    "lLc",
    "l,l,c",
    "L . L C ",
    "l  W c"
);

my @seps = ('.',',','\s');
my $sep_regex = '[' . join('', @seps) . ']*';
my $regex_def = join '', (
    '[lL]',
    $sep_regex,
    '[lL]',
    $sep_regex,
    '[cC]'
);

print "definition: $regex_def\n";

foreach my $str (@strings) {
    if ( $str =~ /$regex_def/ ) {
        print "$str matches\n";
    } else {
        print "$str doesn't match\n";
    }
}

This regular expression could also be simplified by using case-insensitive matching (which means $match =~ /$regex/i ). If you run this a few times on the strings that you define, you can easily see cases that don't validate according to your regular expression. Building up your regular expression this way can be useful in only defining your separator symbols once, and I think that people are likely to use the same separators for a wide variety of abbreviations (like IRS, I.R.S, irs, etc).

You also might think about looking into approximate string matching algorithms, which are popular in a large number of areas. The idea behind these is that you define a scoring system for comparing strings, and then you can measure how similar input strings are to your canonical string, so that you can recognize that "LLC" and "lLc" are very similar strings.

Alternatively, as other people have suggested you could write an input sanitizer that removes unwanted characters like whitespace, commas, and periods. In the context of the program above, you could do this:

my $sep_regex = '[' . join('', @seps) . ']*';
foreach my $str (@strings) {
    my $copy = $str;
    $copy =~ s/$sep_regex//g;
$copy = lc $copy;
    print "$str -> $copy\n";
}

If you have control of how the data is entered originally, you could use such a sanitizer to validate input from the users and other programs, which will make your analysis much easier.

Upvotes: 0

Alex Brown

Reputation: 42942

Don't use regexes, instead build up a map of all discovered (so far) entries and their 'canonical' (favourite) versions.

Also build a tool to discover possible new variants of postfixes by identifying common prefixes to a certain number of characters and printing them on the screen so you can add new rules.

Upvotes: 0

paxdiablo

Reputation: 882686

Regexes (other than relatively simple ones) and readability rarely go hand-in-hand. Don't misunderstand me, I love them for the simplicity they usually bring, but they're not fit for all purposes.

If you want readability, just create an array of possible values and iterate through them, checking your field against them to see if there's a match.

Unless you're doing gene sequencing, the speed difference shouldn't matter. And it will be a lot easier to add a new one when you discover it. Adding an element to an array is substantially easier than reverse-engineering a regex.

Upvotes: 2

Chris Lutz

Reputation: 75479

You could just strip out excess crap. Using Perl:

my $suffix = "l. lc.."; # the worst case imaginable!

$suffix =~ s/[.\s]//g;
# no matter what variation $suffix was, it's now just "llc"

Obviously this may maul your input if you use it on the full company name, but getting too in-depth with how to do that would require knowing what language we're working with. A possible regex solution is to copy the company name and strip out a few common words and any words with more than (about) 4 characters:

my $suffix = $full_name;

$suffix =~ s/\w{4,}//g; # strip words of more than 4 characters
$suffix =~ s/(a|the|an|of)//ig; # strip a few common cases
# now we can mangle $suffix all we want
# and be relatively sure of what we're doing

It's not perfect, but it should be fairly effective, and more readable than using a single "monster regex" to try to match all of them. As a rule, don't use a monster regex to match all cases, use a series of specialized regexes to narrow many cases down to a few. It will be easier to understand.

Upvotes: 2

Alex Brown

Reputation: 42942

You can squish periods and whitespace first, before matching: for instance, in perl:

while (<>) {
  $Sq = $_;
  $Sq =~ s/[.\s]//g; # squish away . and " " in the temporary save version
  $Sq = lc($Sq);
  /^llc$/ and $_ = 'L.L.C.'; # try to match, if so save the canonical version
  /^ibm/ and $_ = 'IBM'; # a different match
  print $_;
}

Upvotes: 0

Eli Grey

Reputation: 35913

The first two "l" parts can be simplified by [the first "l" part here]{2}.

Upvotes: 0

Trying to simplify a Regex

Answers (6)

Related Questions