lawyeR
lawyeR

Reputation: 7654

Use regular expressions inside only the end portion of strings

I am pre-processing a data frame with 100,000+ blog URLs, many of which contain content from the blog header. The grep function lets me drop many of those URLs because they pertain to archives, feeds, images, attachments or a variety of other reasons. One of them is that they contain “atom”.

For example,

string <- "http://www.example.com/2014/05/update-on-atomic-energy-legislation/feed/atom/archives/"
row <- "one" 
df <- data.frame(row, string) 
df$string <- as.character(df$string) df[-grep("atom", string), ]

My problem is that the pattern “atom” might appear in a blog header, which is important content, and I do not want to drop those URLs.

How can I concentrate the grep on only the final 20 characters (or some number that greatly reduces the risk that I will grep out content that contains the pattern rather than the ending elements? This question uses $ at the end but is not using R; besides, I don't know how to extend the $ back 20 characters. Regular Expressions _# at end of string

Assume that it is not always the case that the pattern has forward slashes on either or both ends. E.g, /atom/.

The function substr can isolate the end portion of the strings, but I don’t know how to grep only within that portion. The pseudo-code below draws on the %in% function to try to illustrate what I would like to do.

substr(df$string, nchar(df$string)-20, nchar(df$string)) # extracts last 20 characters; start at nchar end -20, to end

But what is the next step?

string[-grep(pattern = "atom" %in% (substr(string, nchar(string)-20, nchar(string))), x = string)]

Thank you for your guidance.

Upvotes: 1

Views: 132

Answers (3)

lawyeR
lawyeR

Reputation: 7654

I chose the second answer because it is easier for me to understand and because with the first one it is not possible to predict how many forward slashes to include in the “component depth”.

The second answer translated into English from the inside function to the broadest function out says: Define the final 20 characters of your string with the substr() function, your substring;

then find if the pattern “atom” is in that sub-string with the grep() function;

then count whether “atom” was found more than once in the substring, thus with length greater than zero, and that row will be omitted;

finally, if no pattern is matched, i.e., no “atom” is found in the final 20 characters, leave the row alone – all done with the if…else() function

Upvotes: 0

Tony Breyal
Tony Breyal

Reputation: 5368

You could try using a URL component depth approach (i.e. only return df rows which contain the word "atom" after 5 slashes):

find_first_match <- function(string, pattern) {
  components <- unlist(strsplit(x = string, split = "/", fixed = TRUE), use.names = FALSE)
  matches <- grepl(pattern = pattern, x = components)
  if(any(matches) == TRUE) {
    first.match <- which.min(matches)
  } else {
    first.match <- NA
  }
  return(first.match)
}

Which can be used as follows:

# Add index for first component match of "atom" in url
df$first.match <- lapply(df$string, find_first_match, pattern = "atom")

# Return rows which have the word "atom" only after the first 5 components
df[first.match >= 6]

#   row                                                                                 string first.match
# 1 one http://www.example.com/2014/05/update-on-atomic-energy-legislation/feed/atom/archives/           6

This gives you control over which URLs to return based on the depth of when "atom" appears

Upvotes: 0

phonixor
phonixor

Reputation: 1693

lastpart=substr(df$string, nchar(df$string)-20, nchar(df$string))
if(length(grep("atom",lastpart))>0){
  # atom was in there
} else {
  # atom was not in there
} 

could also do it without the lastpart..

if(length(grep("atom",substr(df$string, nchar(df$string)-20, nchar(df$string))))>0){
  # atom was in there
} else {
  # atom was not in there
} 

but things become harder to read... (gives better perfomance though)

Upvotes: 1

Related Questions