Jim
Jim

Reputation: 21

Truncate words within each element of a character vector in R

I have a data frame where one column is a character vector and every element in the vector is the full text of a document. I want to truncate words in each element so that maximum word length is 5 characters.

For example:

a <- c(1, 2)
b <- c("Words longer than five characters should be truncated",
       "Words shorter than five characters should not be modified")
df <- data.frame("file" = a, "text" = b, stringsAsFactors=FALSE)

head(df)
  file                                                      text
1    1     Words longer than five characters should be truncated
2    2 Words shorter than five characters should not be modified

And this is what I'm trying to get:

  file                                           text
1    1     Words longe than five chara shoul be trunc
2    2 Words short than five chara shoul not be modif

I've tried using strsplit() and strtrim() to modify each word (based in part on split vectors of words by every n words (vectors are in a list)):

x <- unlist(strsplit(df$text, "\\s+"))
y <- strtrim(x, 5)
y
[1] "Words" "longe" "than"  "five"  "chara" "shoul" "be"    "trunc" "Words" "short" "than" 
[12] "five"  "chara" "shoul" "not"   "be"    "modif"

But I don't know if that's the right direction, because I ultimately need the words in a data frame associated with the correct row, as shown above.

Is there a way to do this using gsub and regex?

Upvotes: 1

Views: 1156

Answers (2)

hwnd
hwnd

Reputation: 70750

If you're looking to utilize gsub to perform this task:

> df$text <- gsub('(?=\\b\\pL{6,}).{5}\\K\\pL*', '', df$text, perl=T)
> df
#   file                                           text
# 1    1     Words longe than five chara shoul be trunc
# 2    2 Words short than five chara shoul not be modif

Upvotes: 1

Molx
Molx

Reputation: 6931

You were on the right track. In order for your idea to work, however, you have to do the split/trim/combine for each row separated. Here's a way to do it. I was very verbose on purpose, to make it clear, but you can obviously use less lines.

df$text <- sapply(df$text, function(str) {
  str <- unlist(strsplit(str, " "))
  str <- strtrim(str, 5)
  str <- paste(str, collapse = " ")
  str
})

And the output:

> df
  file                                           text
1    1     Words longe than five chara shoul be trunc
2    2 Words short than five chara shoul not be modif

The short version is

df$text <- sapply(df$text, function(str) {
  paste(strtrim(unlist(strsplit(str, " ")), 5), collapse = " ")  
})

Edit:

I just realized you asked if it is possible to do this using gsub and regex. Even though you don't need those for this, it's still possible, but harder to read:

df$text <- sapply(df$text, function(str) {
  str <- unlist(strsplit(str, " "))
  str <- gsub("(?<=.{5}).+", "", str, perl = TRUE)
  str <- paste(str, collapse = " ")
  str
})

The regex matches anything that appears after 5 characters and replaces those with nothing. perl = TRUE is necessary to enable the regex lookbehind ((?<=.{5})).

Upvotes: 0

Related Questions