Reputation: 2055
I have a string on which I need to do a regex match (I'm working in R). It looks like:
"354542676655341568:1373344735:270969722:text1,text2,text4,text8"
This string has 4 parts separated by colens (:
). I have multiple strings with different values, but composed of the same 4 parts.
The first numerical part I plan to match using "[0-9]{18}"
For the second part (it is a timestamp), I have a piece of code that generates a regex for a range that I'll append. A sample looks like this:
":0*13733([0-3][0-9]{4}|4([0-3][0-9]{3}|4([0-7][0-9]{2}|800))):"
This above pattern matches for all numbers between 1373300000 & 1373344800.
The Third part also is a plain [0-9]{9}
The problem is the fourth part, where I'll have to match the text part. I'll have a list of text content like text1
, text3
, text5
. I need to accept the string if it has atleast one of the texts from the list. It's more like a substring match for the fourth part.
I've thought of splitting the text, but in my application, it would be a poor design with high resource costs. Hence, I'd like to generate one regex that does the entire match together.
I tried a few things to test this out, but I'm getting false positives. Any help available?
checktext = "check:text1,text2,text3"
> grepl("check:[a-zA-Z0-9 ]+,text2",checktext)
[1] TRUE
> grepl("check:[a-zA-Z0-9, ]+,text2",checktext)
[1] TRUE
> grepl("check:[a-zA-Z0-9, ]+,[text3|text2]",checktext)
[1] TRUE
> grepl("check:[a-zA-Z0-9, ]+,[text3|text4]",checktext)
[1] TRUE
> grepl("check:[a-zA-Z0-9, ]+,[text5|text4]",checktext)
[1] TRUE
> grepl("check:[a-zA-Z0-9, ]+,[text5|text4][a-zA-Z0-9, ]$",checktext)
[1] FALSE
> grepl("check:[a-zA-Z0-9, ]+,[text5|text3][a-zA-Z0-9, ]$",checktext)
[1] FALSE
> grepl("check:[a-zA-Z0-9, ]+,[text5|text3][a-zA-Z0-9, ]+?$",checktext)
[1] TRUE
> grepl("check:[a-zA-Z0-9, ]+,[text5|text4][a-zA-Z0-9, ]+?$",checktext)
[1] TRUE
> grepl("check:.*[text1].*",checktext)
[1] TRUE
> grepl("check:.*[text2].*",checktext)
[1] TRUE
> grepl("check:.*[text3].*",checktext)
[1] TRUE
> grepl("check:.*[text2|text4].*",checktext)
[1] TRUE
> grepl("check:.*[text5|text4].*",checktext)
After @sgibb 's reply, I put all the parts together to make the final pattern as:
"[0-9]{18}:0*13733([0-3][0-9]{4}|4([0-3][0-9]{3}|4([0-7][0-9]{2}|800))):[0-9]{9}:[a-zA-Z0-9, ]+,(Samsung|Nokia)"
and my text string was:
"354542676655341568:1373344735:270969722:Samsung,Galaxy"
It didn't match. Is it due to putting all of them together? When I removed the last (text) part from the regex, it matched.
> finalpattern
[1] "[0-9]{18}:0*13733([0-3][0-9]{4}|4([0-3][0-9]{3}|4([0-7][0-9]{2}|800))):[0-9]{9}:"
> keysample
[1] "354542676655341568:1373344735:270969722:Samsung,Galaxy"
> grepl(finalpattern,keysample)
[1] TRUE
Upvotes: 0
Views: 131
Reputation: 25736
IMHO you use the [
wrong. A [
contains a class of characters to match (means at least one of the character in [
should match). If you want to group a pattern/string (e.g. text5|text4
) you have to use (
:
grepl("check:[a-zA-Z0-9, ]+,(text3|text4)",checktext)
# [1] TRUE
grepl("check:[a-zA-Z0-9, ]+,(text5|text4)",checktext)
# [1] FALSE
This should remove most of your false-positives.
Address your edit:
Your regular expression is wrong (the part after the :
).
[a-zA-Z0-9, ]+,
: you look for alphanumeric characters (BTW see ?regex
: classes [:alnum:]
) occurring at least ones and followed by a ,
. This will match agains Samsung
.
Next you look for (Samsung|Nokia)
but there is only Galaxy
left.
There are multiple solutions:
"[[:alnum:], ]*(Samsung|Nokia)[[:alnum:], ]*"
"(Samsung|Nokia),[[:alnum:], ]+"
".*(Samsung|Nokia).*"
# ...
Or you should think about splitting your string at :
and analyze each part separately.
Upvotes: 3