Reputation: 8247

occurrences of duplicated values and returning unique values separated by comma in R

I have following dataframe in R

 Number      ship_no
  4432          1
  4432          2
  4564          1
  4389          5
  6578          6
  4389          3
  4355          10
  4355          10

I want to find which duplicated Number are repeated in unique ship_no

 Number       ship_no
  4432          1,2
  4389          5,3
  4355          10

How can I do this in r ?

I attempted following code in R

library(dplyr)
group_by(Number) %>%
filter(duplicated(Number)) %>%
summarize(Number = paste0(unique(ship_no), collapse = ','))

Upvotes: 2

Answers (4)

akrun

Reputation: 886948

Here is an option using data.table

library(data.table)
unique(setDT(df1)[df1[,  .I[.N > 1], Number]$V1])[, .(ship_no = toString(ship_no)) , Number]
#    Number ship_no
#1:   4432    1, 2
#2:   4389    5, 3
#3:   4355      10

Upvotes: 1

lmo

Reputation: 38500

In base R, you can do this in three lines with aggrgate and lapply. It will result in a data.frame, where your second argument is a list containing the vectors of duplicated values.

# collect ship_nos for each Number into single column
mydf <- aggregate(ship_no ~ Number, data=dat, c)
# drop rows without multiple ship_nos
mydf <- mydf[lengths(mydf[["ship_no"]]) > 1,]
# sort values in ship_no columns and drop any duplicates within each list item
mydf[["ship_no"]] <- lapply(mydf[["ship_no"]],
                            function(x) sort(x[!duplicated(x)]))

This returns

mydf
  Number ship_no
1   4355      10
2   4389    3, 5
3   4432    1, 2

Upvotes: 1

Florian

Reputation: 25385

Why your solution does not work:

With the statement

filter(duplicated(Number))

You are keeping only rows that are duplicates of an earlier encountered row:

duplicated(df$Number)
[1] FALSE  TRUE FALSE FALSE FALSE  TRUE

Solution 1 with data.table

library(data.table)
dt = data.table(df)
dt[,if(.N>1).(ship_no=list(ship_no)),Number]

Solution 2 with dplyr

You can combine your duplicated statement with another duplicated call with fromLast=False as follows:

df = read.table(text="Number      ship_no
4432          1
4432          2
4564          1
4389          5
6578          6
4389          3",header=T)

library(dplyr)
df %>% group_by(Number) %>%
  filter(duplicated(Number) | duplicated(Number,fromLast=TRUE)) %>%
  summarize(ship_no = paste0(unique(ship_no), collapse = ','))

Upvotes: 1

F. Privé

Reputation: 11728

You can do:

df %>%
  group_by(Number) %>%
  filter(n() > 1) %>%
  summarize(ship_no = paste0(unique(ship_no), collapse = ','))

Upvotes: 2

occurrences of duplicated values and returning unique values separated by comma in R

Answers (4)

Related Questions