Scott
Scott

Reputation: 6736

Javascript regular expression matches some strings, but fails on other seemingly identical strings

JSFiddle

I am using Facebook's API to pull in daily crime reports from my county's police department page. They follow a mostly standardized format, with the following patterns being what I'm going off of, and a few annoying inconsistencies:

  1. The header is between 3-4 lines followed by two new line characters \n\n (The code cuts this out and is not part of the output below)
  2. Different categories of crimes committed are grouped together with the first line being a capitalized string describing the types of crimes. Each category is separated by two new line characters \n\n above it.
  3. Actual crimes committed follow the category title described above, each (most of the time) separated by one new line character \n
  4. As an "artifact" of whatever they are copying and pasting from, a few times there are various unicode characters substituting the hyphen, including \u2013, \u2014 and \u2015
  5. All crimes reported start with the string "BEAT", or on rare occasion "Beat"

The problem that I am running into is that sometimes the code below catches a category title detailed in #2 above, yet in other posts, the (seemingly) exact same string and circumstances doesn't catch. The angular code I'm using in a service can be seen below

me.parsePosts = function() {
    var posts = facebookService.getRandomPosts(); // Just a method to return 5 random reports for now
    angular.forEach(posts, function(post) {
        // Some reports are incorrectly double spaced and inconsistent
        // with spacing and capitalization
        var fixedPost = post.message
                            .replace(/^Beat/, 'BEAT') // They were a little inconsistent back in the day
                            .replace('\n\n###', '') // All posts end with a useless ###
                            .replace('\u2013', '-') // Pesky unicode characters!
                            .replace('\u2014', '-')
                            .replace('\u2015', '-')
                            .replace('\n\nARRESTED', '\nARRESTED') // would help if this was consistent
                            .replace(/(?:\\[rn ]|[\r\n ]+)BEAT/gi, '\nBEAT'), // same with the reports...
            postSplit = fixedPost.split('\n\n'), // split up the post into potential categories
            header = postSplit.splice(0,1); // I don't want the standard header of the post

        // Pass in postSplit .join()'d back together for debugging
        me.getCategoriesFromPost(postSplit, postSplit.join('\n\n'));
    });
};

me.getCategoriesFromPost = function(postArray, post) {
    var categoryRegexp = /[A-Z\-&\/: ]+$/,
        categories = [], uniqCategories = [];

    angular.forEach(postArray, function(a) {
        var split = a.split('\n'), // Extract the category from the list of crimes
            potentialCategory = split[0].trim(); // There's often an unwanted trailing space

        if (potentialCategory.match(categoryRegexp)) {
            categories.push(potentialCategory);
        }
    });

    // Every blue moon they repost a category twice, I just want one
    // and I'll merge the two together afterwards
    uniqCategories = categories.filter(function(a,b) {
        return categories.indexOf(a) == b;
    });

    console.log(uniqCategories); // log off all the categories in the post
    console.log(post); // Display the actual post so i can visibly verify it all worked
};

So as an example, in one post:

console.log(uniqCategories); (original raw text as received from facebookService.getRandomPosts()):

BURGLARY COMMERCIAL
BEAT E1 SPRINT WIRELESS, 7300 ASSATEAGUE DR, 3/19 0426: Unknown suspect(s) gained entry to the business by breaking the glass door. The suspect(s) stole electronics. 14-25638
BEAT D6 MONTPELIER LIQUORS, 7500 MONTPELIER RD, 3/19 0513: Unknown suspect(s) gained entry to the business by breaking the glass door. The suspect(s) stole liquor, lottery tickets, and an ATM machine. 14-25641
BEAT D4 MACY’S, 10300 LITTLE PATUXENT PKWY, 3/19 0501: Two unknown male suspects, wearing masks, gained entry to the business by breaking the glass door. The suspects were interrupted by a store employee and fled without taking anything. 14-25642
SUSPECT VEHICLE: black Dodge pickup 

BURGLARY NON COMMERCIAL
BEAT B3 6600 ASPERN DR, 3/17 2354: Four suspects gained entry to the residence via unknown means. No sign of forced entry. 14-25220 
ARRESTED:
Karlin Lamont Harris, 23, of Pirch Way in Elkridge, charged with fourth-degree burglary
Steven Lee Hubbard, 29, of Edgewater, charged with fourth-degree burglary
Jessie Tyler Holt, 22, of Pine Tree Rd in Jessup, charged with fourth-degree burglary
Brittney Victoria McEnaney, 26, of Pasadena, charged with fourth-degree burglary
BEAT C1 6900 BENDBOUGH CT, 3/18 1400: Unknown suspect(s) gained entry to the residence via the front door. No sign of forced entry. The suspect(s) stole jewelry. 14-25392
BEAT B4 7100 DEEP FALLS WAY, 3/18 1100-1440: Unknown suspect(s) gained entry to the residence by forcing a rear basement window. The suspect(s) stole jewelry and electronics. 14-25404 

VEHICLE THEFT & ATTEMPTS
BEAT E2 7-11, 9600 WASHINGTON BLVD, 3/18 0409: 
05 Acura Tag 1AV8629 14-25277 (Keys left in vehicle.)

And console.log(post); returns

["BURGLARY COMMERCIAL", "BURGLARY NON COMMERCIAL", "VEHICLE THEFT & ATTEMPTS"]

Yet on another post, console.log(uniqCategories); (original raw text as received from facebookService.getRandomPosts()):

ROBBERY COMMERCIAL
BEAT B3 ZIPS DRY CLEANING, 6500 OLD WATERLOO RD, 3/22 1900: An unknown suspect entered the business through an unlocked rear door. The suspect threatened an employee and demanded cash. The employee complied. The suspect fled the business. 14-26959 
SUSPECT: B/M, 5’8-5’9, black hoodie and pants, backpack 

ROBBERY NON COMMERCIAL
BEAT E7 7-11 PARKING LOT, 9100 MAIER RD, 03/23 1632: Suspect stole cash from an acquaintance and caused an abrasion with an unknown sharp object. Police are investigation the possibility it may be drug related. 14-27243 
SUSPECT: B/M, 5’8, 200 lbs, dreadlocks

BURGLARY COMMERCIAL
BEAT E1 MEGATELECOM, 8600 WASHINGTON BLVD #106, 3/22 0933: Unknown suspect(s) gained entry to the business by breaking a window. The suspect(s) stole electronics. 14-26793
BEAT F3 CATTAIL CREEK COUNTRY CLUB, 3600 CATTAIL CREEK DR, 03/22 1600- 03/23 0630: Unknown suspect(s) gained entry to a garage through an unlocked door. The suspect(s) stole golf carts. 14-27127

BURGLARY NON COMMERCIAL
BEAT E2 9300 BREAMORE CT, 03/21 1210 ATTEMPT: Two suspects attempted to gain entry via a rear slider. The resident yelled and the suspects fled, but were later caught by police. 14-26458
ARRESTED:
Travis Donte Mackell, 23, of Baltimore, charged with fourth-degree burglary
Maurice Debuiel Aye, 26, of Baltimore, charged with fourth-degree burglary
BEAT D3 5500 COLUMBIA RD, 3/21: An unknown suspect gained entry to the residence through an unlocked rear slider. The suspect woke the resident, who ultimately got the suspect to leave. It appears he may have entered the wrong residence. 14-26712 
SUSPECT: B/M, 5’8, 200 lbs
BEAT B4 7500 HEARTHSIDE WAY, 3/22 1700- 1800: Three unknown black male suspects stole a bicycle, which was unsecured on a bike rack. 14-27185
BEAT E3 9100 BRYANT AVE, 3/23 2213: Unknown suspects gained entry to the residence by prying open the kitchen window. Nothing appeared to be taken. 14-27308
BEAT B3 8000 KEETON RD, 3/23 1930- 2230: Unknown suspect(s) gained entry to the residence through an unlocked window. The suspect(s) stole a computer and jewelry. 14-27314
BEAT A3 9000 FREDERICK RD, 3/23 0205: The suspect kicked in an acquaintance’s door after a verbal altercation and assaulted him. 14-27361 
ARRESTED: Michael Wilson Sittig, 34, of Frederick Road in Ellicott City, charged with second-degree assault, third- and fourth-degree burglary, malicious destruction of property, and disorderly conduct

VEHICLE THEFT & ATTEMPTS
BEAT D2 5100 ELIOTS OAK DR, 03/22 2130- 3/23 0700: 
12 Hyundai Sonata Red MD 5AN2945 14-27135

and console.log(post) only returns:

["ROBBERY COMMERCIAL", "VEHICLE THEFT & ATTEMPTS"]

I expect it to return ["ROBBERY COMMERCIAL", "ROBBERY NON COMMERCIAL", "BURGLARY COMMERCIAL", "BURGLARY NON COMMERCIAL", "VEHICLE THEFT & ATTEMPTS"]

In that instance, it's clear that my code matches the former instances of BURGLARY COMMERCIAL and BURGLARY NON COMMERCIAL, but not the latter. What gives? Also, feel free to correct me and tell me I'm doing it all wrong with the wall of .replace(), and that there's a better way to do it, if there is. Thanks a bunch for the help!

Upvotes: 3

Views: 109

Answers (2)

blurfus
blurfus

Reputation: 14031

You were missing a few more delimiter replacements before your split. Namely, I added:

post.message
...
.replace( /\s*\n\s\n/g, '\n\n')
.replace(/\s BEAT/g, 'BEAT') ... 

See updated fiddle

TL;DR; (updated based on comments)

If you look at the messages after the original replace(...) function calls, and before the .split('\n\n'), some of them have a blank space at the very end followed by a newline, then another blank, and newline.

None of your original replace() took care of that. Also, some only had a newline, blank, newline pattern (& why the first space in the regex has a *). Then, some of the BEAT keywords in the message were preceded by one or more blanks so we are removing those to ensure that BEAT is always preceded by a newline.

If you un-comment out the logging lines in the fiddle and comment out the fix, you will see the array of elements at each step.

In one of those, you will see that one array element contains not only what we expect (one report) but the next category is embedded there as well (which is why you would see fewer).

Then I just tried to see what was different about those line endings and checking if the replace() functions took care of them before the split(...) call...

Let me know if you want me to explain it better.

Upvotes: 1

Felipe Brahm
Felipe Brahm

Reputation: 3160

String.replace replaces the FIRST occurrence. You need to change all your String.replace with a regex to replace all occurrences. Something like this (although I'm not sure how the unicode chars work in regex):

post.message
  .replace(/^Beat/ig, 'BEAT') // They were a little inconsistent back in the day
  .replace('/\n\n###/g', '') // All posts end with a useless ###
  .replace('/\u2013/g', '-') // Pesky unicode characters!
  .replace('/\u2014/g', '-')
  .replace('/\u2015/g', '-')
  .replace('/\n\nARRESTED/g', '\nARRESTED') // would help if this was consistent
  .replace(/(?:\\[rn ]|[\r\n ]+)BEAT/gi, '\nBEAT'), // same with the reports...

Upvotes: 2

Related Questions