s_scolary
s_scolary

Reputation: 1399

Create prediction grid from dplyr piping

I'm hoping someone out there has a solution for using some form of expand.grid in piping using dplyr. I am doing some modeling where I have a few different groups (or Types below) and the groups have different ranges for x & y data. Once I run a gam on the data I am interested in creating a plot for the predictions, but I only want to predict values over the range that each value occupies, not the whole range of the data set.

I already have a working example posted below, but I'm wondering if there is a way to get around using a loop and complete my task.

Cheers

require(ggplot2)
require(dplyr)

# Create some data
df  = data.frame(Type = rep(c("A","B"), each = 100),
                 x = c(rnorm(100, 0, 1), rnorm(100, 2, 1)),
                 y = c(rnorm(100, 0, 1), rnorm(100, 2, 1)))

# and if you want to check out the data
ggplot(df,aes(x,y,col=Type)) + geom_point() + stat_ellipse()

# OK so I have no issue extracting the minimum and maximum values 
# for each type
df_summ = df %>%
  group_by(Type) %>%
  summarize(xmin = min(x),
            xmax = max(x),
            ymin = min(y),
            ymax = max(y))
df_summ

# and I can create a loop and use the expand.grid function to get my 
# desired output
test = NULL
for(ii in c("A","B")){
  df1 = df_summ[df_summ$Type == ii,]
  x = seq(df1$xmin, df1$xmax, length.out = 10)
  y = seq(df1$ymin, df1$ymax, length.out = 10)
  coords = expand.grid(x = x, y = y)
  coords$Type = ii
  test = rbind(test, coords)
}

ggplot(test, aes(x,y,col = Type)) + geom_point()

But what I would really like to do is find a way to bypass the loop and try and get the same output straight from my piping operator. I've tried a few combinations using the do() function but to no effect, and the one posted below is just one of many, many failed attempts

df %>%
  group_by(Type) %>%
  summarize(xmin = min(x),
            xmax = max(x),
            ymin = min(y),
            ymax = max(y)) %>%
  do(data.frame(x = seq(xmin, xmax, length.out = 10),
                y = seq(ymin, ymax, length.out = 10)))

# this last line returns an error
# Error in is.finite(from) : 
#   default method not implemented for type 'closure'

Upvotes: 3

Views: 827

Answers (2)

bschneidr
bschneidr

Reputation: 6277

Using the data_grid function from the modelr package, here's one way to do it:

library(dplyr)
library(modelr)

df %>%
   group_by(Type) %>%
   data_grid(x, y) %>%
ggplot(aes(x,y, color = Type)) + geom_point()

enter image description here

This approach generates for each value of x and each value of y in each group a row containing the pair x and y. So each x-y pair in the resulting dataframe is based only on values of x and y that actually appear in the data.

Upvotes: 1

MrFlick
MrFlick

Reputation: 206546

Your do() attempt was almost right. The trick is just to re-group after the summarize (which seems to drop the grouping). Also you need to make sure to grab the values from the data in the chain using .$. Try this

test <- df %>%
  group_by(Type) %>%
  summarize(xmin = min(x),
            xmax = max(x),
            ymin = min(y),
            ymax = max(y)) %>%
  group_by(Type) %>%
  do(expand.grid(x = seq(.$xmin, .$xmax, length.out = 10),
                y = seq(.$ymin, .$ymax, length.out = 10)))
ggplot(test, aes(x,y,col = Type)) + geom_point()

enter image description here

Upvotes: 2

Related Questions