Reputation: 99
I'm currently working on a large data set (ca. 30k rows), And I'm working on creating a hedonic
regression. The next step would be to create weekly dummy variables.
Now my data has assigned weekly number depending on the day the data was measured. There are 50 different weeks (1-52, 2 missing unaccounted). These weekly numbers are repeated until the change after about 10 rows,
however they are also recurring, as new product categories are measures. There are 132 are available in the dataset and one category contains between 100 - 300 rows
.
This is an example of the dataset
UPC Weeks
1111112016 1
1111112016 1
1111112016 2
1111112016 2
1111112016 3
1111112016 3
1111112440 1
1111112440 1
1111112440 2
1111112440 2
1111112440 3
1111112440 3
Now to create dummy variables, I created 50 columns, each having about 30k rows to represent the dataset. I would like to assign 1 to the row of the dummy week whenever dummy week (hence the column name) and real week (row of the orig. dataset) are equal.
Example Dummy (DW = Dummy Week):
DW1 DW2
NA NA
NA NA
NA NA
I tried the following:
for (i in 1:seq(Soap$WEEK)){
if Soap$WEEK[i] == seq(from=1, by=1, to=52){
for (j in names(x)){
x$DW[[j]] = 1
else {
x$DW[[j]] = 0
}}}}
I know it is wrong, however I'm unable to resolve my problem. I would appreciate any help in this matter.
Upvotes: 3
Views: 2376
Reputation: 24198
We can use model.matrix()
from the stats
package to dummify your data. First, we'll need to convert Weeks
to a factor
column.
df$Weeks <- as.factor(df$Weeks)
Now we can run model.matrix()
:
model.matrix(~ Weeks + UPC + 0, data = df)
# Weeks1 Weeks2 Weeks3 UPC
#1 1 0 0 1111112016
#2 1 0 0 1111112016
#3 0 1 0 1111112016
#4 0 1 0 1111112016
#5 0 0 1 1111112016
#6 0 0 1 1111112016
#7 1 0 0 1111112440
#8 1 0 0 1111112440
#9 0 1 0 1111112440
#10 0 1 0 1111112440
#11 0 0 1 1111112440
#12 0 0 1 1111112440
You can also just use model.matrix(~ . + 0 , data = df)
, as numeric columns will be automatically passed over. The + 0
in the formula avoids replacing the first level by the Intercept
. To see the difference try to run it without 0
.
Alternatively, you can also use dummyVars
from the caret
package. Here, no Intercept
is the default behaviour:
library(caret)
dm <- dummyVars(" ~ .", data = df)
data.frame(predict(dm, newdata = df))
# UPC Weeks.1 Weeks.2 Weeks.3
#1 1111112016 1 0 0
#2 1111112016 1 0 0
#3 1111112016 0 1 0
#4 1111112016 0 1 0
#5 1111112016 0 0 1
#6 1111112016 0 0 1
#7 1111112440 1 0 0
#8 1111112440 1 0 0
#9 1111112440 0 1 0
#10 1111112440 0 1 0
#11 1111112440 0 0 1
#12 1111112440 0 0 1
Upvotes: 6
Reputation: 4187
You can solve this by using sapply
and comparing the values of the Weeks
column with th numeric part of the dummy column names which you can extract with substr
.
On your example dataset:
# create the dummy columns and fill them with NA's
dat[, paste0('DW', 1:3)] <- NA
# compare the values in 'Weeks' with the numeric part of the column names
dat[, 3:5] <- sapply(names(dat)[3:5], function(x) as.integer(substr(x,3,3) == dat$Weeks))
the result:
> dat
UPC Weeks DW1 DW2 DW3
1 1111112016 1 1 0 0
2 1111112016 1 1 0 0
3 1111112016 2 0 1 0
4 1111112016 2 0 1 0
5 1111112016 3 0 0 1
6 1111112016 3 0 0 1
7 1111112440 1 1 0 0
8 1111112440 1 1 0 0
9 1111112440 2 0 1 0
10 1111112440 2 0 1 0
11 1111112440 3 0 0 1
12 1111112440 3 0 0 1
Upvotes: 0