user3614783
user3614783

Reputation: 841

Filtering using multiple regex pattern specifications in R

I have a filtering problem where I'm trying to select observations based on the presence of strings with any one of 4 different patterns:

  1. A string of 10 consecutive digits (ex. 1250126681)
  2. A string of 13 consecutive digits (ex. 9781626724266)
  3. A string beginning with "id" and followed by 9 consecutive digits (ex. id975448501)
  4. A string (length=10) of a combination of capitalized letters and digits (ex. B004TLHNOC)

This is a sample dataset:

FromName<-c("PubA","PubB","PubC","PubB","PubC","PubB")
PostName<-c("https://www.amazon.com/gp/product/1250126681/ref=as_li_tl","https://www.amazon.com/dp/B004TLHNOC/ref=sr_1_1","https://us.macmillan.com/books/9781626724266",
            "https://itunes.apple.com/us/book/six-of-crows/id975448501","http://www.anrdoezrs.net/links/7992675/type/dlg/sid/MAChhjan32018eman/https://www.barnesandnoble.com/w/something-to-howl-about-christine-warren/1127202692","https://www.amazon.com/Beach-House-Cookbook-Mary-Andrews-ebook/dp/B01M2UZS7F/")
df<-cbind(FromName,PostName)

The output should look like this:

enter image description here

I think the regexes for the first 2 patterns are: ^[0-9]{10}$ and ^[0-9]{13}$, and I think [a-z]{2}\d{9} should work for selecting observations with the third pattern, but I'm stuck on pattern #4. I'm also unsure of how to combine multiple regex patterns into a dplyr filter function.

Upvotes: 0

Views: 1049

Answers (2)

Onyambu
Onyambu

Reputation: 79208

Since your data above is a matrix, we can cbind the results as below:

You can use basename:

cbind(df[,1],basename(sub("ref.*","",df[,2])))
     [,1]   [,2]           
[1,] "PubA" "1250126681"   
[2,] "PubB" "B004TLHNOC"   
[3,] "PubC" "9781626724266"
[4,] "PubB" "id975448501"  
[5,] "PubC" "1127202692"   
[6,] "PubB" "B01M2UZS7F"  

or you can do:

cbind(df[,1],sub(".*\\/(\\w+)\\/(?>ref.*)|.*\\/(\\w+)\\S{1,2}$","\\1\\2",df[,2],perl = T))
     [,1]   [,2]          
[1,] "PubA" "1250126681"  
[2,] "PubB" "B004TLHNOC"  
[3,] "PubC" "978162672426"
[4,] "PubB" "id97544850"  
[5,] "PubC" "112720269"   
[6,] "PubB" "B01M2UZS7F"  

Upvotes: 0

Tyr Wiesner-Hanks
Tyr Wiesner-Hanks

Reputation: 1403

The following regexes should work:

[0-9]{10}
[0-9]{13}
id[0-9]{9}
[A-Z0-9]{10}

You should not need ^ and $, as you're trying to match something inside the string. ^abc$ would match abc, but not xabcx.

The second pattern will return a subset of the fourth. Any string of 10 digits is also a string of 10 digits or capital letters. So you only need three patterns to match those four categories. Unless you want to somehow differentiate between 10-character strings of ONLY integers from those with integers and characters, but it doesn't seem like that's the case.

You can use | to check multiple patterns. However, this will return the first match. E.g. str_extract('abcdef','abc|def') returns only abc. If you know that each URL will have no more than 1 match of any category, just do that.

I like using str_extract from stringr:

ProductID = str_extract(PostName, 'id[0-9]{9}|[0-9]{13}|[A-Z0-9]{10}')

Upvotes: 2

Related Questions