Reputation: 8247
I have following dataframe in R
Number ship_no
4432 1
4432 2
4564 1
4389 5
6578 6
4389 3
4355 10
4355 10
I want to find which duplicated Number
are repeated in unique ship_no
Number ship_no
4432 1,2
4389 5,3
4355 10
How can I do this in r ?
I attempted following code in R
library(dplyr)
group_by(Number) %>%
filter(duplicated(Number)) %>%
summarize(Number = paste0(unique(ship_no), collapse = ','))
Upvotes: 2
Views: 127
Reputation: 886948
Here is an option using data.table
library(data.table)
unique(setDT(df1)[df1[, .I[.N > 1], Number]$V1])[, .(ship_no = toString(ship_no)) , Number]
# Number ship_no
#1: 4432 1, 2
#2: 4389 5, 3
#3: 4355 10
Upvotes: 1
Reputation: 38500
In base R, you can do this in three lines with aggrgate
and lapply
. It will result in a data.frame, where your second argument is a list containing the vectors of duplicated values.
# collect ship_nos for each Number into single column
mydf <- aggregate(ship_no ~ Number, data=dat, c)
# drop rows without multiple ship_nos
mydf <- mydf[lengths(mydf[["ship_no"]]) > 1,]
# sort values in ship_no columns and drop any duplicates within each list item
mydf[["ship_no"]] <- lapply(mydf[["ship_no"]],
function(x) sort(x[!duplicated(x)]))
This returns
mydf
Number ship_no
1 4355 10
2 4389 3, 5
3 4432 1, 2
Upvotes: 1
Reputation: 25385
Why your solution does not work:
With the statement
filter(duplicated(Number))
You are keeping only rows that are duplicates of an earlier encountered row:
duplicated(df$Number)
[1] FALSE TRUE FALSE FALSE FALSE TRUE
Solution 1 with data.table
library(data.table)
dt = data.table(df)
dt[,if(.N>1).(ship_no=list(ship_no)),Number]
Solution 2 with dplyr
You can combine your duplicated statement with another duplicated call with fromLast=False
as follows:
df = read.table(text="Number ship_no
4432 1
4432 2
4564 1
4389 5
6578 6
4389 3",header=T)
library(dplyr)
df %>% group_by(Number) %>%
filter(duplicated(Number) | duplicated(Number,fromLast=TRUE)) %>%
summarize(ship_no = paste0(unique(ship_no), collapse = ','))
Upvotes: 1
Reputation: 11728
You can do:
df %>%
group_by(Number) %>%
filter(n() > 1) %>%
summarize(ship_no = paste0(unique(ship_no), collapse = ','))
Upvotes: 2