Ryan Parsons
Ryan Parsons

Reputation: 119

Filling in data frame column using regular expressions (?)

Ok, so I have a data frame of web forum comments. Each row has a cell containing an ID which is part of the link to that comment's parent comment. The rows contain the full permalink to the comment, of which the ID is the varying part.

I'd like to add a column that shows the user name attached to that parent comment. I'm assuming I'll need to use some regular expression function, which I find mystifying at this point.

In workflow terms, I need to find the row whose URL contains the parent comment ID, grab the user name from that row. Here's a toy example:

toy <- rbind(c("yes?", "john", "www.website.com/4908", "3214", NA), c("don't think so", "mary", "www.website.com/3958", "4908", NA))
toy <- as.data.frame(toy)
colnames(toy) <- c("comment", "user", "URL", "parent", "parent_user")

         comment user                  URL parent parent_user
1           yes? john www.website.com/4908   3214        <NA>
2 don't think so mary www.website.com/3958   4908        <NA>

which needs to become:

         comment user                  URL parent parent_user
1           yes? john www.website.com/4908   3214        <NA>
2 don't think so mary www.website.com/3958   4908        john

Some values in this column will be NA, since they're top level comments. So something like,

dataframe$parent_user <- dataframe['the row where parent
ID i is found in the URL column', 'the user name column in that row']

Thanks!!

Upvotes: 3

Views: 166

Answers (3)

akrun
akrun

Reputation: 887981

Here is a vectorized option with stri_extract and match

library(stringi)
toy$parent_user <- toy$user[match(toy$parent,stri_extract(toy$URL, 
            regex=paste(toy$parent, collapse="|")))]
toy
#         comment user                  URL parent parent_user
#1           yes? john www.website.com/4908   3214        <NA>
#2 don't think so mary www.website.com/3958   4908        john

Or as @jazzurro mentioned, a faster option would be using stri_extract with data.table and fmatch

library(data.table)
library(fastmatch)
setDT(toy)[, parent_user := user[fmatch(parent, 
                  stri_extract_last_regex(str=URL, pattern = "\\d+"))]]

Or a base R option would be

with(toy, user[match(parent, sub("\\D+", "", URL))])
#[1] <NA> john
#Levels: john mary

nchar('with(toy, user[match(parent, sub("\\D+", "", URL))])')
#[1] 51

nchar('toy$user[match(toy$parent, basename(as.character(toy$URL)))]')
#[1] 60

Upvotes: 4

thelatemail
thelatemail

Reputation: 93938

Another option, using the basename function from base R, which "removes all of the path up to and including the last path separator (if any)"

toy$user[match(toy$parent, basename(as.character(toy$URL)))]
#1] <NA> john
#Levels: john mary

Upvotes: 6

alistaire
alistaire

Reputation: 43364

Perhaps not the prettiest way to do it, but an option:

toy$parent_user <- sapply(toy$parent, 
                          function(x){p <- toy[x == sub('[^0-9]*', '', toy$URL), 'user'];
                                      ifelse(length(p) > 0, as.character(p), NA)})

toy
#          comment user                  URL parent parent_user
# 1           yes? john www.website.com/4908   3214        <NA>
# 2 don't think so mary www.website.com/3958   4908        john

The second line is really just to deal with cases lacking matches.

Upvotes: 4

Related Questions