Reputation: 950

Use zoo read and split a data frame over a column

I have a table containing observations on scores of restaurants(identified by ID). The variable mean is the mean rating of reviews received in a week-long window centered on each day (i.e. from 3 days before till 3 days later), and the variable count is the number of reviews received in the same window (see the code below for a dput of a randomly-generated sample of my data frame).

I am interested in looking at those restaurants that contain big spikes in either variable (like all of a sudden their mean rating goes up by a lot, or drops suddenly). For those restaurants, I would like to investigate what's going on by plotting the distribution (I have lots of restaurants so I can't do it manually and I have to restrict my domain for semi-manual inspection).

Also, since my data is day-by-day, I would like it to be less granular. In particolar, I want to average all the ratings or counts for a given month in a single value.

I think zoo should help me do it nicely: given the data frame in the example, I think I can convert it to a zoo time series which is aggregate the way I want and split the way I want by using:

z <- read.zoo(df, split = "restaurantID", 
       format = "%m/%d/%Y", index.column = 2, FUN = as.yearmon, aggregate = mean)

however, splitting on restaurantID does not yield the expected result. What I get instead is lots of NAs:

     mean.1006054 count.1006054 mean.1006639 count.1006639 mean.1006704 count.1006704 mean.1007177 count.1007177
Lug 2004           NA            NA           NA            NA           NA            NA           NA            NA
Ago 2004           NA            NA           NA            NA           NA            NA           NA            NA
Nov 2004           NA            NA           NA            NA           NA            NA           NA            NA
Gen 2005           NA            NA           NA            NA           NA            NA           NA            NA
Feb 2005           NA            NA           NA            NA           NA            NA           NA            NA
Mar 2005           NA            NA           NA            NA           NA            NA           NA            NA
         mean.1007296 count.1007296 mean.1007606 count.1007606 mean.1007850 count.1007850 mean.1008272 count.1008272
Lug 2004           NA            NA           NA            NA           NA            NA           NA            NA
Ago 2004           NA            NA           NA            NA           NA            NA           NA            NA
Nov 2004           NA            NA           NA            NA           NA            NA           NA            NA
Gen 2005           NA            NA           NA            NA           NA            NA           NA            NA
Feb 2005           NA            NA           NA            NA           NA            NA           NA            NA
Mar 2005           NA            NA           NA            NA           NA            NA           NA            NA

Note that it works if I don't split it on the restaurantID column.

df$website <- NULL
> z <- read.zoo(df, format = "%m/%d/%Y", index.column = 2, FUN = as.yearmon, aggregate = mean)
> head(z)
         restaurantID     mean count
Lug 2004      1418680 3.500000     1
Ago 2004      1370457 5.000000     1
Nov 2004      1324645 4.333333     1
Gen 2005      1425933 1.920000     1
Feb 2005      1315289 3.000000     1
Mar 2005      1400577 2.687500     1

Also, plot.zoo(z) works but of course the produced graph has no meaning for me.

My questions are:

1) How can I filter the restaurants that have the higher "month-month" spikes in either column?

2) How can I split on restaurantID and plot the time series of only such restaurants?

DATA HERE (wouldn't fit SO's word limit)

Upvotes: 2

Use zoo read and split a data frame over a column

Answers (2)

Related Questions