Kenji
Kenji

Reputation: 581

How to create a sampled choice set for a discrete-choice model using data.table?

I want to estimate a discrete choice model. I have a dataset with people, their current choices at t_1, their choices at t_2 and all possible choices. Since the universe of possible choices is too large, I need to sample so that each person has 30 choices in their choice set. It has to be sampling without replacement and no individual may have duplicate options in the choice set. Both the actual choice at t_2 and the choice at t_1 need to be part of the choice set. Right now I'm trying something like this, with fictional data.

library(data.table)
#Create the fictional data up to the current choice.
choices<-c(1:100) #vector of possible choices   
people<-data.frame(ID=1:10)
setDT(people,key="ID")
people[,"current_choice":=sample(choices,1),by="ID"] #what the person uses now
people[,"chosen":=sample(choices,1),by="ID"] #what the person actually picked at t_2



#expand the dataset to be 30 rows per person and create a choice ID
people<-people[rep(1:.N,30),]
setDT(people,key="ID")    
people[,"choice_id":=seq_len(.N), by="ID"]

#The current choice at t_1 needs to be in the choice set
people[1,"choice_set":=current_choice,by="ID"]

#The actual choice needs to be in the choice set
people[choice_id==2&current_choice!=chosen,"choice_set":= chosen,by="ID"]

#I want the remaining choices to be sampled from the vector of choices, but here is where I'm stuck
people[is.na(choice_set),"choice_set":=sample(choices,1),by="ID]

That last line doesn't prevent duplicate choices within each individual, including duplicating the current and the chosen alternatives.

I have thought about using expand.grid to create all combinations of current choices and potential choices, assigning a random uniform number to them, assigning an even larger number for the rows that have either the current choice or the actual choice, sorting, and then keeping the top 30 rows. The problem is that I run out of memory with the actual 10000 people and 50000 choices.

How should I approach this?

EDIT: After the first answer by Matt, I still come into issues with repeated alternatives in the choice set. I have been trying to solve them with:

library(data.table)
#Create the fictional data up to the current choice.
choices<-c(1:100) #vector of possible choices   
people<-data.frame(ID=1:10)
setDT(people,key="ID")
people[,current_choice:=sample(choices,1),by= .(ID)] #what the person uses now
people[,chosen:= sample(choices,1),by= .(ID)] #what the person actually picked at t_2

#expand the dataset to be 30 rows per person and create a choice ID
people<-people[rep(1:.N,30),]
setDT(people,key="ID")    
people[,choice_id:=seq_len(.N), by=.(ID)]

#The chosen alternative has to be in the choice set
people[choice_id==1L,choice_set:=chosen,by=.(ID) ]
people

#The current chosen alternative has to be in the choice set
people[current_choice!=chosen&choice_id==2L,choice_set:=current_choice,by=.(ID) ]
people

people[is.na(choice_set), choice_set := sample(setdiff(choices,unique(choice_set)), .N), by = .(ID)]

The problem then is that I introduce a missing for those individuals who picked their current choice at t_1 again at t_2.

Upvotes: 2

Views: 129

Answers (1)

Matt Summersgill
Matt Summersgill

Reputation: 4242

Here's how I would approach the problem as I understand it, using 99% code you already presented (with some aesthetic syntax tweaks here and there, mostly removing un-needed quotes around column assignments and using data.table's handy .(...) syntax in the by statements to eliminate those quotes as well).

The main thing I think will help you is the setdiff() function from base R (see help file by running ?base::setdiff) to make sure that the current_choice and the chosen value are excluded from your sampling to fill in the remaining rows after you populate the first two.

library(data.table)
#Create the fictional data up to the current choice.
choices<-c(1:100) #vector of possible choices   
people<-data.frame(ID=1:10)
setDT(people,key="ID")
people[,current_choice:=sample(choices,1),by= .(ID)] #what the person uses now
people[,chosen := sample(choices,1),by= .(ID)] #what the person actually picked at t_2

#expand the dataset to be 30 rows per person and create a choice ID
people<-people[rep(1:.N,30),]
setDT(people,key="ID")    
people[,choice_id:=seq_len(.N), by=.(ID)]

#The current choice at t_1 needs to be in the choice set

## the `choice_id == 1L` is critical here, filtering by just `people[1, ...]` wasn't giving 
## the result you were actually going for
people[choice_id == 1L, choice_set := current_choice, by=.(ID)]

#The actual choice needs to be in the choice set
people[choice_id == 2L
       & current_choice != chosen, choice_set := chosen, by= .(ID)]

## Use setdiff to make sure we sample the rest from a vector excluding the 
## `current_choice` and `chosen` values
people[choice_id > 2L, choice_set := sample(setdiff(choices,c(current_choice,chosen)), .N), by = .(ID)]

Upvotes: 1

Related Questions