Munrock
Munrock

Reputation: 423

remove duplicate values in cell without removing row

I have a column of strings variables that are separated with white space and need to remain strings. How can I remove the duplicate values and values longer than 4 characters?

company        counts 
company1       2222 2222 45345234 425352352352 6574745 299
company2       9909 4363465246 543 323 9909 3454534534 768 

I would like to end up with something like this:

company        counts 
company1       2222 299
company2       9909 543 323 768 

Upvotes: 1

Views: 65

Answers (2)

GKi
GKi

Reputation: 39647

gsub could be ued to remove longer strings and duplicated.

gsub("\\b[^ ]{5,}\\b *", "", dat$counts) |>                #Remove longer than 4
gsub("\\b([^ ]+)\\b (?=.*\\b\\1\\b)", "", x=_, perl=TRUE) #Remove duplicated
#[1] "2222 299"         "543 323 9909 768"

Upvotes: 2

thelatemail
thelatemail

Reputation: 93813

strsplit the strings, remove the long ones and the duplicates and paste back together:

sapply(
    strsplit(dat$counts, "\\s+"),
    \(x) paste(x[nchar(x) <= 4 & (!duplicated(x))], collapse=" ")
)
##[1] "2222 299"         "9909 543 323 768"

Where dat was:

dat <- read.csv(text="company,counts 
company1,2222 2222 45345234 425352352352 6574745 299
company2,9909 4363465246 543 323 9909 3454534534 768")

Upvotes: 3

Related Questions