Reputation: 51
I am trying to reorganize some raw data into a more condense form. Currently the data looks like the below output from the R code. I would like the final output to have columns for time, ID, and all possible desired prices. Then, I want each ID to have only one row for each time with the quantity number put in at the different desired prices(so how many an ID wants at a particular price during this time). So for example, a particular ID might have a quantity of 1 at 100 and quantity of 2 at 101. If it is a buy, then the value should be negative and if it is a sell then positive. For example, -1 for buy at 100 and 2 for sell at 101.
I originally tried doing it through a double for loop with the first loop being time and then the second loop being the ID. Then I was able to look at the quantity column and desired price for an ID and put them into a vector. Afterwards, I combined all the vectors together for that time and then repeated this. When I tried to use this in practice, it was not feasible because the code was too slow as there are hundreds of IDs and thousands of times. Can someone help me do this in a faster and cleaner way?
set.seed(1)
time <- rep(seq(1,5), , each = 15)
id <- sample(342:450,75,replace = TRUE)
price <- sample(99:103,75,replace = TRUE)
Desire.Price <- sample(97:105,75,replace = TRUE)
quantity <- sample(1:4,75,replace = TRUE)
data <- data.frame(time = time, id = id,price = price, Desire.Price = Desire.Price,quantity = quantity)
data$buysell <- 0
data$buysell <- ifelse( data$Desire.Price <= data$price, "BUY","SELL")
I expect the final data set would look something like this.
Final.df <- data.frame(time=NA,id=NA,"97" = NA,"98"=NA ,"99"=NA,"100"=NA,"101"=NA,"102"=NA,"103"=NA
,"104"=NA,"105"=NA)
It would basically condense the original raw data to have all the information for a particular ID in a row during each time period.
Edit: If an ID did not get sampled in that time (for example ID 342 is not in time 1) they should have a row of NA in that time period( So ID 342 would have a row of NA in time 1). I edited the code that generates the samples to have more ids to reflect this( So that they can't all possibility be sampled in every time period).
Upvotes: 1
Views: 123
Reputation: 66490
Here's a tidyverse approach. First, make quantity signed based on BUY/SELL, then sum quantity for each id / time / Desire.Price, then spread those into wide format with a column for each Desire.Price.
library(dplyr); library(tidyr)
data %>%
mutate(quantity_signed = if_else(buysell == "BUY", -quantity, quantity)) %>%
count(id, time, Desire.Price, wt = quantity_signed) %>%
complete(id, time) %>% # EDIT to bring in all times for all id's
spread(Desire.Price, n) %>% View("output")
Upvotes: 3
Reputation: 141
I think this approach is simple comparatively.
# Code
library(reshape2)
#Turning BUY quantity values negative.
data[which(data$buysell=="BUY"),]$quantity <- -(data[which(data$buysell=="BUY"),]$quantity)
#Using dcast function to achieve desired columns.
final.df <- dcast(data,time + id~Desire.Price ,fun=sum,value.var='quantity')
Upvotes: 1