How to create a sampled choice set for a discrete-choice model using data.table?

Question

I want to estimate a discrete choice model. I have a dataset with people, their current choices at t_1, their choices at t_2 and all possible choices. Since the universe of possible choices is too large, I need to sample so that each person has 30 choices in their choice set. It has to be sampling without replacement and no individual may have duplicate options in the choice set. Both the actual choice at t_2 and the choice at t_1 need to be part of the choice set. Right now I'm trying something like this, with fictional data.

library(data.table)
#Create the fictional data up to the current choice.
choices<-c(1:100) #vector of possible choices   
people<-data.frame(ID=1:10)
setDT(people,key="ID")
people[,"current_choice":=sample(choices,1),by="ID"] #what the person uses now
people[,"chosen":=sample(choices,1),by="ID"] #what the person actually picked at t_2



#expand the dataset to be 30 rows per person and create a choice ID
people<-people[rep(1:.N,30),]
setDT(people,key="ID")    
people[,"choice_id":=seq_len(.N), by="ID"]

#The current choice at t_1 needs to be in the choice set
people[1,"choice_set":=current_choice,by="ID"]

#The actual choice needs to be in the choice set
people[choice_id==2¤t_choice!=chosen,"choice_set":= chosen,by="ID"]

#I want the remaining choices to be sampled from the vector of choices, but here is where I'm stuck
people[is.na(choice_set),"choice_set":=sample(choices,1),by="ID]

That last line doesn't prevent duplicate choices within each individual, including duplicating the current and the chosen alternatives.

I have thought about using expand.grid to create all combinations of current choices and potential choices, assigning a random uniform number to them, assigning an even larger number for the rows that have either the current choice or the actual choice, sorting, and then keeping the top 30 rows. The problem is that I run out of memory with the actual 10000 people and 50000 choices.

How should I approach this?

EDIT: After the first answer by Matt, I still come into issues with repeated alternatives in the choice set. I have been trying to solve them with:

library(data.table)
#Create the fictional data up to the current choice.
choices<-c(1:100) #vector of possible choices   
people<-data.frame(ID=1:10)
setDT(people,key="ID")
people[,current_choice:=sample(choices,1),by= .(ID)] #what the person uses now
people[,chosen:= sample(choices,1),by= .(ID)] #what the person actually picked at t_2

#expand the dataset to be 30 rows per person and create a choice ID
people<-people[rep(1:.N,30),]
setDT(people,key="ID")    
people[,choice_id:=seq_len(.N), by=.(ID)]

#The chosen alternative has to be in the choice set
people[choice_id==1L,choice_set:=chosen,by=.(ID) ]
people

#The current chosen alternative has to be in the choice set
people[current_choice!=chosen&choice_id==2L,choice_set:=current_choice,by=.(ID) ]
people

people[is.na(choice_set), choice_set := sample(setdiff(choices,unique(choice_set)), .N), by = .(ID)]

The problem then is that I introduce a missing for those individuals who picked their current choice at t_1 again at t_2.

How to create a sampled choice set for a discrete-choice model using data.table?

Answers (1)

Related Questions