Reputation: 841
I have a filtering problem where I'm trying to select observations based on the presence of strings with any one of 4 different patterns:
This is a sample dataset:
FromName<-c("PubA","PubB","PubC","PubB","PubC","PubB")
PostName<-c("https://www.amazon.com/gp/product/1250126681/ref=as_li_tl","https://www.amazon.com/dp/B004TLHNOC/ref=sr_1_1","https://us.macmillan.com/books/9781626724266",
"https://itunes.apple.com/us/book/six-of-crows/id975448501","http://www.anrdoezrs.net/links/7992675/type/dlg/sid/MAChhjan32018eman/https://www.barnesandnoble.com/w/something-to-howl-about-christine-warren/1127202692","https://www.amazon.com/Beach-House-Cookbook-Mary-Andrews-ebook/dp/B01M2UZS7F/")
df<-cbind(FromName,PostName)
The output should look like this:
I think the regexes for the first 2 patterns are: ^[0-9]{10}$ and ^[0-9]{13}$, and I think [a-z]{2}\d{9} should work for selecting observations with the third pattern, but I'm stuck on pattern #4. I'm also unsure of how to combine multiple regex patterns into a dplyr filter function.
Upvotes: 0
Views: 1049
Reputation: 79208
Since your data above is a matrix, we can cbind the results as below:
You can use basename
:
cbind(df[,1],basename(sub("ref.*","",df[,2])))
[,1] [,2]
[1,] "PubA" "1250126681"
[2,] "PubB" "B004TLHNOC"
[3,] "PubC" "9781626724266"
[4,] "PubB" "id975448501"
[5,] "PubC" "1127202692"
[6,] "PubB" "B01M2UZS7F"
or you can do:
cbind(df[,1],sub(".*\\/(\\w+)\\/(?>ref.*)|.*\\/(\\w+)\\S{1,2}$","\\1\\2",df[,2],perl = T))
[,1] [,2]
[1,] "PubA" "1250126681"
[2,] "PubB" "B004TLHNOC"
[3,] "PubC" "978162672426"
[4,] "PubB" "id97544850"
[5,] "PubC" "112720269"
[6,] "PubB" "B01M2UZS7F"
Upvotes: 0
Reputation: 1403
The following regexes should work:
[0-9]{10}
[0-9]{13}
id[0-9]{9}
[A-Z0-9]{10}
You should not need ^
and $
, as you're trying to match something inside the string. ^abc$
would match abc
, but not xabcx
.
The second pattern will return a subset of the fourth. Any string of 10 digits is also a string of 10 digits or capital letters. So you only need three patterns to match those four categories. Unless you want to somehow differentiate between 10-character strings of ONLY integers from those with integers and characters, but it doesn't seem like that's the case.
You can use |
to check multiple patterns. However, this will return the first match. E.g. str_extract('abcdef','abc|def')
returns only abc
. If you know that each URL will have no more than 1 match of any category, just do that.
I like using str_extract
from stringr
:
ProductID = str_extract(PostName, 'id[0-9]{9}|[0-9]{13}|[A-Z0-9]{10}')
Upvotes: 2