jackStinger
jackStinger

Reputation: 2055

Substring like matches inside regular expressions?

I have a string on which I need to do a regex match (I'm working in R). It looks like:

"354542676655341568:1373344735:270969722:text1,text2,text4,text8"

This string has 4 parts separated by colens (:). I have multiple strings with different values, but composed of the same 4 parts. The first numerical part I plan to match using "[0-9]{18}" For the second part (it is a timestamp), I have a piece of code that generates a regex for a range that I'll append. A sample looks like this:

":0*13733([0-3][0-9]{4}|4([0-3][0-9]{3}|4([0-7][0-9]{2}|800))):"

This above pattern matches for all numbers between 1373300000 & 1373344800. The Third part also is a plain [0-9]{9}

The problem is the fourth part, where I'll have to match the text part. I'll have a list of text content like text1, text3, text5. I need to accept the string if it has atleast one of the texts from the list. It's more like a substring match for the fourth part.

I've thought of splitting the text, but in my application, it would be a poor design with high resource costs. Hence, I'd like to generate one regex that does the entire match together.

I tried a few things to test this out, but I'm getting false positives. Any help available?

checktext = "check:text1,text2,text3"
> grepl("check:[a-zA-Z0-9 ]+,text2",checktext)
[1] TRUE
> grepl("check:[a-zA-Z0-9, ]+,text2",checktext)
[1] TRUE
> grepl("check:[a-zA-Z0-9, ]+,[text3|text2]",checktext)
[1] TRUE
> grepl("check:[a-zA-Z0-9, ]+,[text3|text4]",checktext)
[1] TRUE
> grepl("check:[a-zA-Z0-9, ]+,[text5|text4]",checktext)
[1] TRUE
> grepl("check:[a-zA-Z0-9, ]+,[text5|text4][a-zA-Z0-9, ]$",checktext)
[1] FALSE
> grepl("check:[a-zA-Z0-9, ]+,[text5|text3][a-zA-Z0-9, ]$",checktext)
[1] FALSE
> grepl("check:[a-zA-Z0-9, ]+,[text5|text3][a-zA-Z0-9, ]+?$",checktext)
[1] TRUE
> grepl("check:[a-zA-Z0-9, ]+,[text5|text4][a-zA-Z0-9, ]+?$",checktext)
[1] TRUE
> grepl("check:.*[text1].*",checktext)
[1] TRUE
> grepl("check:.*[text2].*",checktext)
[1] TRUE
> grepl("check:.*[text3].*",checktext)
[1] TRUE
> grepl("check:.*[text2|text4].*",checktext)
[1] TRUE
> grepl("check:.*[text5|text4].*",checktext)

After @sgibb 's reply, I put all the parts together to make the final pattern as:

"[0-9]{18}:0*13733([0-3][0-9]{4}|4([0-3][0-9]{3}|4([0-7][0-9]{2}|800))):[0-9]{9}:[a-zA-Z0-9, ]+,(Samsung|Nokia)"

and my text string was:

"354542676655341568:1373344735:270969722:Samsung,Galaxy"

It didn't match. Is it due to putting all of them together? When I removed the last (text) part from the regex, it matched.

> finalpattern
[1] "[0-9]{18}:0*13733([0-3][0-9]{4}|4([0-3][0-9]{3}|4([0-7][0-9]{2}|800))):[0-9]{9}:"

> keysample
    [1] "354542676655341568:1373344735:270969722:Samsung,Galaxy"
    > grepl(finalpattern,keysample)
    [1] TRUE

Upvotes: 0

Views: 131

Answers (1)

sgibb
sgibb

Reputation: 25736

IMHO you use the [ wrong. A [ contains a class of characters to match (means at least one of the character in [ should match). If you want to group a pattern/string (e.g. text5|text4) you have to use (:

grepl("check:[a-zA-Z0-9, ]+,(text3|text4)",checktext)
# [1] TRUE
grepl("check:[a-zA-Z0-9, ]+,(text5|text4)",checktext)
# [1] FALSE

This should remove most of your false-positives.

Address your edit:

Your regular expression is wrong (the part after the :).

[a-zA-Z0-9, ]+,: you look for alphanumeric characters (BTW see ?regex: classes [:alnum:]) occurring at least ones and followed by a ,. This will match agains Samsung.

Next you look for (Samsung|Nokia) but there is only Galaxy left.

There are multiple solutions:

"[[:alnum:], ]*(Samsung|Nokia)[[:alnum:], ]*"

"(Samsung|Nokia),[[:alnum:], ]+"

".*(Samsung|Nokia).*"

# ...

Or you should think about splitting your string at : and analyze each part separately.

Upvotes: 3

Related Questions