Reputation: 3
I am having trouble inputting comma-delimited .txt data into R with the following format:
stock1,time1,price1,time2,price2,time3,price3
stock1,time4,price4
stock2,time1,price1,time2,price2
stock2,time3,price3
As seen above, the number of columns in each row is not standard. I'd like to create a data frame with three columns (for stock, time, and price):
stock1 time1 price1
stock1 time2 price2
stock1 time3 price3
stock1 time4 price4
stock2 time1 price1
stock2 time2 price2
stock2 time3 price3
How can I split up each row so that I have the desired data frame?
I hope this is clear, thanks!
Upvotes: 0
Views: 43
Reputation: 35314
I got it to work using reshape()
, but it's ugly:
input <- 'stock1,time1,price1,time2,price2,time3,price3\nstock1,time4,price4\nstock2,time1,price1,time2,price2\nstock2,time3,price3\n';
df <- read.csv(text=input,header=F,stringsAsFactors=F); ## replace text=input with file name
subset(setNames(reshape(df,idvar='V1',dir='l',varying=lapply(2:3,seq,ncol(df),by=2L),new.row.names=seq_len(prod(dim(df))))[-2L],c('stock','time','price')),time!='');
## stock time price
## 1 stock1 time1 price1
## 2 stock1 time4 price4
## 3 stock2 time1 price1
## 4 stock2 time3 price3
## 5 stock1 time2 price2
## 7 stock2 time2 price2
## 9 stock1 time3 price3
Notes:
varying
argument must specify the set of wide column sets that must be stacked into long columns in the result frame. If there are multiple such wide column sets, the argument value must be a list with each component being a specification of a wide column set to stack. I use a concise trick with lapply()
to run seq()
once for each of the time
and price
wide column sets, passing it the starting column as the first argument from
(2L
for time
and 3L
for price
), and then ncol(df)
as to
and 2L
as by
, which produces the required list value.new.row.names
argument is necessary because the input data has duplicates in the idvar
(stock) column, and unfortunately, although reshape()
's algorithm works with duplicate ids when transforming from wide to long format, it snatches defeat from the jaws of victory by trying to set row names on the result frame, which fails because of the duplicates. I first tried specifying NULL
as the argument value, hoping that would prevent setting row names entirely, but it had no effect; thus, we need to compute an actual vector of unique row names to keep it happy, and that vector must be sufficiently long to cover the result frame. I think seq_len(prod(dim(df)))
is a reasonable solution that guarantees sufficiency.Upvotes: 0