Reputation: 7654
I am pre-processing a data frame with 100,000+ blog URLs, many of which contain content from the blog header. The grep
function lets me drop many of those URLs because they pertain to archives, feeds, images, attachments or a variety of other reasons. One of them is that they contain “atom”.
For example,
string <- "http://www.example.com/2014/05/update-on-atomic-energy-legislation/feed/atom/archives/"
row <- "one"
df <- data.frame(row, string)
df$string <- as.character(df$string) df[-grep("atom", string), ]
My problem is that the pattern “atom” might appear in a blog header, which is important content, and I do not want to drop those URLs.
How can I concentrate the grep on only the final 20 characters (or some number that greatly reduces the risk that I will grep out content that contains the pattern rather than the ending elements? This question uses $ at the end but is not using R; besides, I don't know how to extend the $ back 20 characters. Regular Expressions _# at end of string
Assume that it is not always the case that the pattern has forward slashes on either or both ends. E.g, /atom/.
The function substr
can isolate the end portion of the strings, but I don’t know how to grep only within that portion. The pseudo-code below draws on the %in% function to try to illustrate what I would like to do.
substr(df$string, nchar(df$string)-20, nchar(df$string))
# extracts last 20 characters; start at nchar end -20, to end
But what is the next step?
string[-grep(pattern = "atom" %in% (substr(string, nchar(string)-20, nchar(string))), x = string)]
Thank you for your guidance.
Upvotes: 1
Views: 132
Reputation: 7654
I chose the second answer because it is easier for me to understand and because with the first one it is not possible to predict how many forward slashes to include in the “component depth”.
The second answer translated into English from the inside function to the broadest function out says:
Define the final 20 characters of your string with the substr()
function, your substring;
then find if the pattern “atom” is in that sub-string with the grep()
function;
then count whether “atom” was found more than once in the substring, thus with length
greater than zero, and that row will be omitted;
finally, if no pattern is matched, i.e., no “atom” is found in the final 20 characters, leave the row alone – all done with the if…else()
function
Upvotes: 0
Reputation: 5368
You could try using a URL component depth approach (i.e. only return df rows which contain the word "atom" after 5 slashes):
find_first_match <- function(string, pattern) {
components <- unlist(strsplit(x = string, split = "/", fixed = TRUE), use.names = FALSE)
matches <- grepl(pattern = pattern, x = components)
if(any(matches) == TRUE) {
first.match <- which.min(matches)
} else {
first.match <- NA
}
return(first.match)
}
Which can be used as follows:
# Add index for first component match of "atom" in url
df$first.match <- lapply(df$string, find_first_match, pattern = "atom")
# Return rows which have the word "atom" only after the first 5 components
df[first.match >= 6]
# row string first.match
# 1 one http://www.example.com/2014/05/update-on-atomic-energy-legislation/feed/atom/archives/ 6
This gives you control over which URLs to return based on the depth of when "atom" appears
Upvotes: 0
Reputation: 1693
lastpart=substr(df$string, nchar(df$string)-20, nchar(df$string))
if(length(grep("atom",lastpart))>0){
# atom was in there
} else {
# atom was not in there
}
could also do it without the lastpart..
if(length(grep("atom",substr(df$string, nchar(df$string)-20, nchar(df$string))))>0){
# atom was in there
} else {
# atom was not in there
}
but things become harder to read... (gives better perfomance though)
Upvotes: 1