Filtering using multiple regex pattern specifications in R

Question

I have a filtering problem where I'm trying to select observations based on the presence of strings with any one of 4 different patterns:

A string of 10 consecutive digits (ex. 1250126681)
A string of 13 consecutive digits (ex. 9781626724266)
A string beginning with "id" and followed by 9 consecutive digits (ex. id975448501)
A string (length=10) of a combination of capitalized letters and digits (ex. B004TLHNOC)

This is a sample dataset:

FromName<-c("PubA","PubB","PubC","PubB","PubC","PubB")
PostName<-c("https://www.amazon.com/gp/product/1250126681/ref=as_li_tl","https://www.amazon.com/dp/B004TLHNOC/ref=sr_1_1","https://us.macmillan.com/books/9781626724266",
            "https://itunes.apple.com/us/book/six-of-crows/id975448501","http://www.anrdoezrs.net/links/7992675/type/dlg/sid/MAChhjan32018eman/https://www.barnesandnoble.com/w/something-to-howl-about-christine-warren/1127202692","https://www.amazon.com/Beach-House-Cookbook-Mary-Andrews-ebook/dp/B01M2UZS7F/")
df<-cbind(FromName,PostName)

The output should look like this:

I think the regexes for the first 2 patterns are: ^[0-9]{10}$ and ^[0-9]{13}$, and I think [a-z]{2}\d{9} should work for selecting observations with the third pattern, but I'm stuck on pattern #4. I'm also unsure of how to combine multiple regex patterns into a dplyr filter function.

Tyr Wiesner-Hanks · Accepted Answer

The following regexes should work:

[0-9]{10}
[0-9]{13}
id[0-9]{9}
[A-Z0-9]{10}

You should not need ^ and $, as you're trying to match something inside the string. ^abc$ would match abc, but not xabcx.

The second pattern will return a subset of the fourth. Any string of 10 digits is also a string of 10 digits or capital letters. So you only need three patterns to match those four categories. Unless you want to somehow differentiate between 10-character strings of ONLY integers from those with integers and characters, but it doesn't seem like that's the case.

You can use | to check multiple patterns. However, this will return the first match. E.g. str_extract('abcdef','abc|def') returns only abc. If you know that each URL will have no more than 1 match of any category, just do that.

I like using str_extract from stringr:

ProductID = str_extract(PostName, 'id[0-9]{9}|[0-9]{13}|[A-Z0-9]{10}')

Filtering using multiple regex pattern specifications in R

Answers (2)

Related Questions