Reputation: 423
I have a column of strings variables that are separated with white space and need to remain strings. How can I remove the duplicate values and values longer than 4 characters?
company counts
company1 2222 2222 45345234 425352352352 6574745 299
company2 9909 4363465246 543 323 9909 3454534534 768
I would like to end up with something like this:
company counts
company1 2222 299
company2 9909 543 323 768
Upvotes: 1
Views: 65
Reputation: 39647
gsub
could be ued to remove longer strings and duplicated.
gsub("\\b[^ ]{5,}\\b *", "", dat$counts) |> #Remove longer than 4
gsub("\\b([^ ]+)\\b (?=.*\\b\\1\\b)", "", x=_, perl=TRUE) #Remove duplicated
#[1] "2222 299" "543 323 9909 768"
Upvotes: 2
Reputation: 93813
strsplit
the strings, remove the long ones and the duplicates and paste
back together:
sapply(
strsplit(dat$counts, "\\s+"),
\(x) paste(x[nchar(x) <= 4 & (!duplicated(x))], collapse=" ")
)
##[1] "2222 299" "9909 543 323 768"
Where dat
was:
dat <- read.csv(text="company,counts
company1,2222 2222 45345234 425352352352 6574745 299
company2,9909 4363465246 543 323 9909 3454534534 768")
Upvotes: 3