Reputation: 119
Ok, so I have a data frame of web forum comments. Each row has a cell containing an ID which is part of the link to that comment's parent comment. The rows contain the full permalink to the comment, of which the ID is the varying part.
I'd like to add a column that shows the user name attached to that parent comment. I'm assuming I'll need to use some regular expression function, which I find mystifying at this point.
In workflow terms, I need to find the row whose URL contains the parent comment ID, grab the user name from that row. Here's a toy example:
toy <- rbind(c("yes?", "john", "www.website.com/4908", "3214", NA), c("don't think so", "mary", "www.website.com/3958", "4908", NA))
toy <- as.data.frame(toy)
colnames(toy) <- c("comment", "user", "URL", "parent", "parent_user")
comment user URL parent parent_user
1 yes? john www.website.com/4908 3214 <NA>
2 don't think so mary www.website.com/3958 4908 <NA>
which needs to become:
comment user URL parent parent_user
1 yes? john www.website.com/4908 3214 <NA>
2 don't think so mary www.website.com/3958 4908 john
Some values in this column will be NA, since they're top level comments. So something like,
dataframe$parent_user <- dataframe['the row where parent
ID i is found in the URL column', 'the user name column in that row']
Thanks!!
Upvotes: 3
Views: 166
Reputation: 887981
Here is a vectorized option with stri_extract
and match
library(stringi)
toy$parent_user <- toy$user[match(toy$parent,stri_extract(toy$URL,
regex=paste(toy$parent, collapse="|")))]
toy
# comment user URL parent parent_user
#1 yes? john www.website.com/4908 3214 <NA>
#2 don't think so mary www.website.com/3958 4908 john
Or as @jazzurro mentioned, a faster option would be using stri_extract
with data.table
and fmatch
library(data.table)
library(fastmatch)
setDT(toy)[, parent_user := user[fmatch(parent,
stri_extract_last_regex(str=URL, pattern = "\\d+"))]]
Or a base R
option would be
with(toy, user[match(parent, sub("\\D+", "", URL))])
#[1] <NA> john
#Levels: john mary
nchar('with(toy, user[match(parent, sub("\\D+", "", URL))])')
#[1] 51
nchar('toy$user[match(toy$parent, basename(as.character(toy$URL)))]')
#[1] 60
Upvotes: 4
Reputation: 93938
Another option, using the basename
function from base R, which "removes all of the path up to and including the last path separator (if any)"
toy$user[match(toy$parent, basename(as.character(toy$URL)))]
#1] <NA> john
#Levels: john mary
Upvotes: 6
Reputation: 43364
Perhaps not the prettiest way to do it, but an option:
toy$parent_user <- sapply(toy$parent,
function(x){p <- toy[x == sub('[^0-9]*', '', toy$URL), 'user'];
ifelse(length(p) > 0, as.character(p), NA)})
toy
# comment user URL parent parent_user
# 1 yes? john www.website.com/4908 3214 <NA>
# 2 don't think so mary www.website.com/3958 4908 john
The second line is really just to deal with cases lacking matches.
Upvotes: 4