OverFlow Police
OverFlow Police

Reputation: 861

create data table by cartesian product (type) of vectors

My question has two parts. First, lets say I have long vectors like years, cities, and temp below, and I want to create a data table that includes combinations of all the three vectors. For simplicity I create a short example here, by cheating and using cbind:

years <- rep(c(1950:1951), 2)
cities <- c(rep("LA", 2), rep("NYC", 2))
temp <- c(10, 20, NA, 40)
dat_tb <- cbind(years, cities, temp) %>% data.table()

dat_tb
  years cities temp
1:  1950     LA   10
2:  1951     LA   20
3:  1950    NYC <NA>
4:  1951    NYC   40

My question is that if the vectors are large, I have to compute number of repetitions I had created above, to enable me to use cbind. How can I create a data table like dat_tb above just using the vectors below, without using cbind

years = c(1950, 1951)
cities = c("LA", "NYC")
temp = (10, 20, NA, 40)

The second part of my question is that lets say I accidentally dropped the NA values like so

dat_tb <- na.omit(dat_tb)

Now, I do not know where the NA was, so, I use the answer of the question in the first part, to initiate a data table that looks like:

   years cities temp
1:  1950     LA 666
2:  1951     LA 666
3:  1950    NYC 666
4:  1951    NYC 666

How can I convert the 666's using the data tale I have, after dropping the NA's:

   years cities temp
1:  1950     LA   10
2:  1951     LA   20
3:  1951    NYC   40

to go back to the data table with missing value

   years cities temp
1:  1950     LA   10
2:  1951     LA   20
3:  1950    NYC  666
4:  1951    NYC   40

Upvotes: 2

Views: 116

Answers (2)

Cole
Cole

Reputation: 11255

One can directly make a table with data.table:

library(data.table)

years <- rep(c(1950:1951), 2)
cities <- c(rep("LA", 2), rep("NYC", 2))
temp <- c(10, 20, NA, 40)

data.table(years, cities, temp)

   years cities temp
1:  1950     LA   10
2:  1951     LA   20
3:  1950    NYC   NA
4:  1951    NYC   40

It also works when you have unequal vector length. Shorter vectors are repeated until they equal the length of the longest one. This means that it's not an exact match because of how it repeats (see NYC has the NA in the first example while LA does here):

years = c(1950, 1951)
cities = c("LA", "NYC")
temp = (10, 20, NA, 40)
data.table(years, cities, temp)

   years cities temp
1:  1950     LA   10
2:  1951    NYC   20
3:  1950     LA   NA
4:  1951    NYC   40

For your second question, I don't follow, but you determine missing rows with an anti-join.

dat_tb2 <- na.omit(copy(dat_tb))

dat_tb[!dat_tb2, on = .(years,cities)]

   years cities temp
1:  1950    NYC <NA>

union(dat_tb2, dat_tb[!dat_tb2, on = .(years,cities)])
   years cities temp
1:  1950     LA   10
2:  1951     LA   20
3:  1951    NYC   40
4:  1950    NYC <NA>

Of course, you lose the row order but then again you should be careful before randomly omitting data :)

Upvotes: 1

akrun
akrun

Reputation: 887108

We an use CJ

library(data.table)
CJ(years, cities, temp)

Upvotes: 3

Related Questions