karyn-h
karyn-h

Reputation: 133

R: Extracting the last few digits from a vector of characters

In a data frame, one of the columns is textual data which looks like:

df <- data.frame("Index" = 1:3, "Content" = c("Happy 2021! word count: 2",
"Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. Today is a good day. word count:100",
"Thank you very much for your time. word count: 7"))

The last several characters are always "word count: n". I hope to extra the n and put it into a new column.

I have tried to write a function to do so

wordCount = function (x) {
  digit = -1
  while(is.numeric(str_sub(essay$content,digit,-1))){
    digit = digit -1
  }
  str_sub(essay$content,digit,-1)
}

But it doesn't work because is.numeric(str_sub(essay$content,digit,-1)) always returns false since this column is treated as characters by R.

Does anyone have a better approach?

Upvotes: 1

Views: 497

Answers (2)

Clemsang
Clemsang

Reputation: 5481

In base R you can use:

as.numeric(gsub(".*:", "", df$Content))

Upvotes: 2

Gregor Thomas
Gregor Thomas

Reputation: 145775

I would use stringi::stri_extract_last_regex to get the last numbers with a number-matching regex pattern. This should work:

library(stringi)
df$word_count = as.numeric(stri_extract_last_regex(df$Content, "[0-9]+"))
df["word_count"]
#   word_count
# 1          2
# 2        100
# 3          7

Upvotes: 1

Related Questions