Tex
Tex

Reputation: 950

Use zoo read and split a data frame over a column

I have a table containing observations on scores of restaurants(identified by ID). The variable mean is the mean rating of reviews received in a week-long window centered on each day (i.e. from 3 days before till 3 days later), and the variable count is the number of reviews received in the same window (see the code below for a dput of a randomly-generated sample of my data frame).

I am interested in looking at those restaurants that contain big spikes in either variable (like all of a sudden their mean rating goes up by a lot, or drops suddenly). For those restaurants, I would like to investigate what's going on by plotting the distribution (I have lots of restaurants so I can't do it manually and I have to restrict my domain for semi-manual inspection).

Also, since my data is day-by-day, I would like it to be less granular. In particolar, I want to average all the ratings or counts for a given month in a single value.

I think zoo should help me do it nicely: given the data frame in the example, I think I can convert it to a zoo time series which is aggregate the way I want and split the way I want by using:

z <- read.zoo(df, split = "restaurantID", 
       format = "%m/%d/%Y", index.column = 2, FUN = as.yearmon, aggregate = mean)

however, splitting on restaurantID does not yield the expected result. What I get instead is lots of NAs:

     mean.1006054 count.1006054 mean.1006639 count.1006639 mean.1006704 count.1006704 mean.1007177 count.1007177
Lug 2004           NA            NA           NA            NA           NA            NA           NA            NA
Ago 2004           NA            NA           NA            NA           NA            NA           NA            NA
Nov 2004           NA            NA           NA            NA           NA            NA           NA            NA
Gen 2005           NA            NA           NA            NA           NA            NA           NA            NA
Feb 2005           NA            NA           NA            NA           NA            NA           NA            NA
Mar 2005           NA            NA           NA            NA           NA            NA           NA            NA
         mean.1007296 count.1007296 mean.1007606 count.1007606 mean.1007850 count.1007850 mean.1008272 count.1008272
Lug 2004           NA            NA           NA            NA           NA            NA           NA            NA
Ago 2004           NA            NA           NA            NA           NA            NA           NA            NA
Nov 2004           NA            NA           NA            NA           NA            NA           NA            NA
Gen 2005           NA            NA           NA            NA           NA            NA           NA            NA
Feb 2005           NA            NA           NA            NA           NA            NA           NA            NA
Mar 2005           NA            NA           NA            NA           NA            NA           NA            NA

Note that it works if I don't split it on the restaurantID column.

df$website <- NULL
> z <- read.zoo(df, format = "%m/%d/%Y", index.column = 2, FUN = as.yearmon, aggregate = mean)
> head(z)
         restaurantID     mean count
Lug 2004      1418680 3.500000     1
Ago 2004      1370457 5.000000     1
Nov 2004      1324645 4.333333     1
Gen 2005      1425933 1.920000     1
Feb 2005      1315289 3.000000     1
Mar 2005      1400577 2.687500     1

Also, plot.zoo(z) works but of course the produced graph has no meaning for me.

My questions are:

1) How can I filter the restaurants that have the higher "month-month" spikes in either column?

2) How can I split on restaurantID and plot the time series of only such restaurants?

DATA HERE (wouldn't fit SO's word limit)

Upvotes: 2

Views: 899

Answers (2)

Lucas Fortini
Lucas Fortini

Reputation: 2460

The robfilter r package was developed to filter time series data to pick out outliers based on robust statistics methods for time series analysis. You can use the adore.filter function to fit a pattern to the data and then pick the outliers that deviate far from the signal.

Upvotes: 0

Gary Weissman
Gary Weissman

Reputation: 3627

Try:

# helper function to calculate change per time interval in a sequence
difflist <- function(v) {rr <- 0; for (i in 2:length(v)) {rr <- c(rr, v[i] - v[i-1])}; return(rr) }

# make center as dates
df$center <- as.Date(df$center,format='%m/%d/%Y')

# sort data frame in time order
df <- df[order(df$restaurantID, df$center),]

# now calculate the change in each column
deltas <- ddply(df, .(restaurantID), function(x) {cbind(center = x$center, delta_mean = difflist(x$mean), delta_count = difflist(x$count)) } )

# filter out only the big spikes
deltas_big <- subset(deltas, delta_mean > 2 | delta_count > 3)

# arrange the data
delta_melt <- melt(deltas_big,id.vars=c('restaurantID','center'))

# now plot by time
ggplot(delta_melt, aes(x=center,y=value,color=variable)) + geom_point()

Upvotes: 0

Related Questions