Ram Iyer
Ram Iyer

Reputation: 1564

How do I write a regex to search for items within UA-Parser?

I am using UA-Parser to create a table of devices for analytics...I have a csv of user-agent strings from our server. I am using the stock UA-Parser for Node package (ua-parser-js.)

However, I am having difficulty parsing some Droid user-agent strings.

Current Regex for Droid is

 /\s((milestone|droid[2x]?))[globa\s]*\sbuild\//i

The above matches

Mozilla/5.0 (Linux; U; Android 2.3.4; en-us; DROIDX Build/4.5.1_57_DX8-51) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1,182

But does not match

Mozilla/5.0 (Linux; U; Android 4.1.2; en-us; DROID RAZR Build/9.8.2O-72_VZW-16) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30,652
Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; DROID X2 Build/4.5.1A-DTN-200-18) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1,152

How should modify the regex to filter the above strings?

Upvotes: 0

Views: 260

Answers (4)

Spudley
Spudley

Reputation: 168715

To solve this we need to isolate the part of the string that is causing us a problem.

Let's cut the strings down and only at the part of the strings that we're interested in:

DROIDX Build compared with DROID RAZR Build or DROID X2 Build

We can see that they all match the droid, and the [2x] is optional, so that doesn't matter.

The problem is in the next bit: [globa\s].

This is not optional, and requires that immediately after the word droid (with or without a following 2 or X), we have one or more of the characters in this list g,l,o,b,a, or a white space.

We have RAZR and X2 in the failing strings. If any of the characters in those words are not in the above list, then the match fails. (As it turns out, almost none of the characters are in the list, but it would fail for a single one).

So the quick and easy fix here is to add the characters r,z,x and 2 to the globa\s.

This will fix it for the given examples -- ie it will now accept the RAZR or X2 in this section of the string.

However, to allow for other possible cases, you may want to be a bit more lenient and allow any alpha-numeric characters. It's up to you, but there's no predicting what UA strings are going to appear in the future.

So therefore, I would suggest replacing the whole globa but with a-z0-9.

 /\s((milestone|droid[2x]?))[a-z0-9\s]*\sbuild\//i

Even this may not pick up all possible variants that could appear, but that's the trouble with user agent strings; they're not exactly a well-defined format; they can contain pretty much anything.

[EDIT] The OP adds a request for the RAZR or X2 strings to be included in the returned result string.

The short answer is that this would mean moving the relevant part of the pattern into the bracketed section, alongside the droid pattern.

However, this does complicate things, because while we want those strings to be included, we may not want others which were previously excluded -- ie the strings that previously matched the globa\s pattern. The problem here is that I don't have any examples of what those excluded strings may have been, or why they're excluded. And likewise, I don't know what strings we would want to include, beyond RAZR or X2. I would guess that we'd need to be relatively lenient, but it's not easy to know how to distinguish them without knowing what the possibilities are (and indeed, it may be very difficult even when we do know them).

Given the above, the only real option open to me is to suggest adding RAZR and X2 into the bracketed section, so that they are picked up specifically:

 /\s((milestone|droid[2x]?(\s(razr|x2)\s)?))[a-z0-9\s]*\sbuild\//i

This will match both the required strings.

The problem, of course, is that it won't match any other possible variants that haven't been described here. Allowing for more would require knowing more about what the possible variants are, but since we've only been asked to look at these specific examples, that's all I can really offer for now.

Upvotes: 1

Zach Leighton
Zach Leighton

Reputation: 1941

What everyone else said but a simpler version..

/\s((milestone|droid[2x]?))[globa\w\s]*\sbuild\//i

Just add a \w to capture the droid suffix.

Upvotes: 0

Hurricane Hamilton
Hurricane Hamilton

Reputation: 574

This matches all three:

/\s(milestone|droid[x]?\s[^\s]*)[globa\s]*build\//i

It matches:

a whitespace character, then
either: 'milestone' OR 'droid' followed by 0 or 1 'x' characters, then
    a whitespace character, then
    zero to infinite characters that aren't white space,then
zero to infinite characters g,l,o,b,a, or whitespace then
'build' then
the '/' character

all in a case insensitive manner.

It matches the DROIDX Build/ in:

 Mozilla/5.0 (Linux; U; Android 2.3.4; en-us; DROIDX Build/4.5.1_57_DX8-51) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1,182

The DROID RAZR Build/ in:

Mozilla/5.0 (Linux; U; Android 4.1.2; en-us; DROID RAZR Build/9.8.2O-72_VZW-16) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30,652

The DROID X2 Build/ in:

Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; DROID X2 Build/4.5.1A-DTN-200-18) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1,152

Upvotes: 0

rtcherry
rtcherry

Reputation: 4880

If you only need to add RAZR and X2 support: /\s((milestone|droid(?:2|x|\s+razr|\s+x2)?))[globa\s]*\sbuild\//i

Edit: Fair warning, I have no idea what the expected values can be, I just based that on the UA strings you posted in the question.

Upvotes: 0

Related Questions