Reputation: 27
Hi I would like to combine 2 columns from a table and create a list of unique values in a new table.
example: dataset = DT
Source Destination
A B
A C
B C
New table (DT_New)
Unique names
A
B
C
I already tried to use the following code:
DT_New <- unlist(DT)
DT_New<- data.table(DT_New)
DT_New = DT_New[!duplicated(DT_New), ]
When I use this code it returns the folowing result:
New table (DT_New)
Unique names
A
B
B
C
It seems that the code above can distinct the "B" from the column Source and the column destination. How can I avoid this issue and get the table with unique values?
Unique names
A
B
C
Upvotes: 0
Views: 89
Reputation: 27
thanks for all your input. All your answers were correct, but not solving my particular problem. Therefore, I realized the problem should be in the data prep code ... Apparently, there was a whitespace at the end of each value of the source column, which doesn't show in the table view of Rstudio. Silly mistake ...
Thanks to Franks answer I saw the whitespace.
Upvotes: 0
Reputation: 66819
You could use a graph to store the data (with the igraph package) and then these names will be associated with each node/vertex:
library(igraph)
g = with(DT, graph_from_edgelist(cbind(Source, Destination)))
names(V(g))
# [1] "A" "B" "C"
You can see that the Source->Destination links are retained in the edges:
E(g)
# + 3/3 edges from 440136f (vertex names):
# [1] A->B A->C B->C
Upvotes: 1
Reputation: 11255
For larger datasets, union
is a good route.
data.table(unique_col = union(DT[['Source']], DT[['Destination']]))
# unique_col
#1: A
#2: B
#3: C
identical(data.table(unique_col = union(DT[['Source']], DT[['Destination']]))
, data.table(unique_col = unique(unlist(DT))))
#[1] TRUE
Here's an example with a million records in each column:
set.seed(1)
DT <- data.table(Source = sample(1:1E6, 1E6, replace = T)
, Destination = sample(1:1E6 + 500000, 1E6, replace = T))
Unit: milliseconds
expr min lq mean median uq max neval
dt_union 173.2102 179.4019 281.2719 222.8879 395.7681 535.6448 20
dt_unique 1271.1621 1487.0988 1581.3864 1582.5068 1682.7661 1842.5898 20
Upvotes: 1
Reputation: 388807
Just unlist
and select unique
values
data.table::data.table(unique_col = unique(unlist(DT)))
# unique_col
#1: A
#2: B
#3: C
OR with your duplicated
approach it would be
DT_New <- unlist(DT)
data.table(unique_col = DT_New[!duplicated(DT_New)])
Upvotes: 1