Reputation: 1527
In R I've converted a DocumentTermMatrix with an ngram of 4 into a dataframe and now I want to split the ngram into two columns, one with the first 3 words of the string, the other with the last word. I can accomplish this through multiple steps, however given the size of the df I was hoping to get it accomplished in on line.
Here's what I'm trying to accomplish:
# str_name w123 w4 freq
# 1 One Two Three Four One Two Three Four 10
This gives me the first three words:
df <- data.frame(str_name = "One Two Three Four", freq = 10)
df %>% separate(str_name, c("w123","w4"), sep = "\\w+$", remove=FALSE)
# str_name w123 w4 freq
# 1 One Two Three Four One Two Three 10
This gives me the last word but also contains a space:
df <- data.frame(str_name = "One Two Three Four", freq = 10)
df %>% separate(str_name, c("sp","w4"), sep = "\\w+\\s\\w+\\s\\w+", remove=FALSE)
# str_name sp w4 freq
# 1 One Two Three Four Four 10
This is the long way
df <- data.frame(w4 = "One Two Three Four", freq = 10)
df <- df %>% separate(w4, c('w1', 'w2', 'w3', 'w4'), " ")
df$lookup <- paste(df$w1,df$w2,df$w3)
# w1 w2 w3 w4 freq lookup
# 1 One Two Three Four 10 One Two Three
Upvotes: 1
Views: 474
Reputation: 887731
We can use base R
methods to solve this
res <- cbind(df, read.table(text=sub("\\s(\\S+)$", ",\\1", df$str_name),
sep=",", header=FALSE, col.names = c("w123", "w4"), stringsAsFactors=FALSE))[c(1,3,4,2)]
res
# str_name w123 w4 freq
#1 One Two Three Four One Two Three Four 10
Upvotes: 0
Reputation: 215117
Try \\s(?=\\w+$)
which looks for the space before the last word in the string to split:
df %>% separate(str_name, into = c("w123", "w4"), sep = "\\s(?=\\w+$)", remove = F)
# str_name w123 w4 freq
# 1 One Two Three Four One Two Three Four 10
\\s(?=[\\S]+$)
is another option which is more greedy than the above one which looks for the last space in the string to split.
df %>% separate(str_name, into = c("w123", "w4"), sep = "\\s(?=[\\S]+$)", remove = F)
# str_name w123 w4 freq
# 1 One Two Three Four One Two Three Four 10
Upvotes: 4