kegoosse
kegoosse

Reputation: 27

Combine 2 columns in a new table and remove duplicates

Hi I would like to combine 2 columns from a table and create a list of unique values in a new table.

example: dataset = DT

Source    Destination
A          B
A          C
B          C

New table (DT_New)

Unique names
A
B
C

I already tried to use the following code:

DT_New <- unlist(DT)
DT_New<- data.table(DT_New)
DT_New = DT_New[!duplicated(DT_New), ]

When I use this code it returns the folowing result:

New table (DT_New)

Unique names
A
B
B
C

It seems that the code above can distinct the "B" from the column Source and the column destination. How can I avoid this issue and get the table with unique values?

Unique names
A
B
C

Upvotes: 0

Views: 89

Answers (4)

kegoosse
kegoosse

Reputation: 27

thanks for all your input. All your answers were correct, but not solving my particular problem. Therefore, I realized the problem should be in the data prep code ... Apparently, there was a whitespace at the end of each value of the source column, which doesn't show in the table view of Rstudio. Silly mistake ...

Thanks to Franks answer I saw the whitespace.

Upvotes: 0

Frank
Frank

Reputation: 66819

You could use a graph to store the data (with the igraph package) and then these names will be associated with each node/vertex:

library(igraph)
g = with(DT, graph_from_edgelist(cbind(Source, Destination)))
names(V(g))
# [1] "A" "B" "C"

You can see that the Source->Destination links are retained in the edges:

E(g)
# + 3/3 edges from 440136f (vertex names):
# [1] A->B A->C B->C

Upvotes: 1

Cole
Cole

Reputation: 11255

For larger datasets, union is a good route.

data.table(unique_col = union(DT[['Source']], DT[['Destination']]))

#   unique_col
#1:          A
#2:          B
#3:          C

identical(data.table(unique_col = union(DT[['Source']], DT[['Destination']]))
          , data.table(unique_col = unique(unlist(DT))))
#[1] TRUE

Here's an example with a million records in each column:

set.seed(1)
DT <- data.table(Source = sample(1:1E6, 1E6, replace = T)
                 , Destination = sample(1:1E6 + 500000, 1E6, replace = T))

Unit: milliseconds
      expr       min        lq      mean    median        uq       max neval
  dt_union  173.2102  179.4019  281.2719  222.8879  395.7681  535.6448    20
 dt_unique 1271.1621 1487.0988 1581.3864 1582.5068 1682.7661 1842.5898    20

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 388807

Just unlist and select unique values

data.table::data.table(unique_col = unique(unlist(DT)))

#   unique_col
#1:          A
#2:          B
#3:          C

OR with your duplicated approach it would be

DT_New <- unlist(DT)
data.table(unique_col = DT_New[!duplicated(DT_New)])

Upvotes: 1

Related Questions