Reputation: 133
In a data frame, one of the columns is textual data which looks like:
df <- data.frame("Index" = 1:3, "Content" = c("Happy 2021! word count: 2",
"Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. word count:100",
"Thank you very much for your time. word count: 7"))
The last several characters are always "word count: n". I hope to extra the n and put it into a new column.
I have tried to write a function to do so
wordCount = function (x) {
digit = -1
while(is.numeric(str_sub(essay$content,digit,-1))){
digit = digit -1
}
str_sub(essay$content,digit,-1)
}
But it doesn't work because is.numeric(str_sub(essay$content,digit,-1)) always returns false since this column is treated as characters by R.
Does anyone have a better approach?
Upvotes: 1
Views: 497
Reputation: 145775
I would use stringi::stri_extract_last_regex
to get the last numbers with a number-matching regex pattern. This should work:
library(stringi)
df$word_count = as.numeric(stri_extract_last_regex(df$Content, "[0-9]+"))
df["word_count"]
# word_count
# 1 2
# 2 100
# 3 7
Upvotes: 1