Inputting Data and Splitting Rows in R with Non-Standard Columns

Question

I am having trouble inputting comma-delimited .txt data into R with the following format:

stock1,time1,price1,time2,price2,time3,price3
stock1,time4,price4
stock2,time1,price1,time2,price2
stock2,time3,price3

As seen above, the number of columns in each row is not standard. I'd like to create a data frame with three columns (for stock, time, and price):

stock1 time1 price1
stock1 time2 price2
stock1 time3 price3
stock1 time4 price4
stock2 time1 price1
stock2 time2 price2
stock2 time3 price3

How can I split up each row so that I have the desired data frame?

I hope this is clear, thanks!

bgoldst · Accepted Answer

I got it to work using reshape(), but it's ugly:

input <- 'stock1,time1,price1,time2,price2,time3,price3
stock1,time4,price4
stock2,time1,price1,time2,price2
stock2,time3,price3
';
df <- read.csv(text=input,header=F,stringsAsFactors=F); ## replace text=input with file name
subset(setNames(reshape(df,idvar='V1',dir='l',varying=lapply(2:3,seq,ncol(df),by=2L),new.row.names=seq_len(prod(dim(df))))[-2L],c('stock','time','price')),time!='');
##    stock  time  price
## 1 stock1 time1 price1
## 2 stock1 time4 price4
## 3 stock2 time1 price1
## 4 stock2 time3 price3
## 5 stock1 time2 price2
## 7 stock2 time2 price2
## 9 stock1 time3 price3

Notes:

When going from wide to long format, the varying argument must specify the set of wide column sets that must be stacked into long columns in the result frame. If there are multiple such wide column sets, the argument value must be a list with each component being a specification of a wide column set to stack. I use a concise trick with lapply() to run seq() once for each of the time and price wide column sets, passing it the starting column as the first argument from (2L for time and 3L for price), and then ncol(df) as to and 2L as by, which produces the required list value.
The new.row.names argument is necessary because the input data has duplicates in the idvar (stock) column, and unfortunately, although reshape()'s algorithm works with duplicate ids when transforming from wide to long format, it snatches defeat from the jaws of victory by trying to set row names on the result frame, which fails because of the duplicates. I first tried specifying NULL as the argument value, hoping that would prevent setting row names entirely, but it had no effect; thus, we need to compute an actual vector of unique row names to keep it happy, and that vector must be sufficiently long to cover the result frame. I think seq_len(prod(dim(df))) is a reasonable solution that guarantees sufficiency.

Inputting Data and Splitting Rows in R with Non-Standard Columns

Answers (1)

Related Questions