Lotte Victor
Lotte Victor

Reputation: 89

using tidyverse to simultaneously filter, summarize and put result in new column of the same data frame at specified position

I'm tring to add the results of a summary statistic to a tibble at a specified position in another column of the same tibble, after filtering all my data for the subset I want to summarize. It contains results on a spatially explicit simulation in a landscape of grid cells. I have a column specifiying the column and the row of the landscape and a column for the results. What I want to do is take the target grid cell e.g. row= 2, col =2 and calculate the variance of the target cell and its eight surrounding cells. The result of this should be stored in a new column of the dataframe, in the row of the target grid cell. The filtering of the data to select only the nine patches I'm interested in works alright, but storing the result in the new column at the specific location is not working. I need a general solution, since I want to loop through all the grid cells (given by row and col) and also loop through several tibbles, containing similar data but for different landscapes I'm putting example code here, my real dataset is much larger.

data= tibble(row=c(1,1,1, 2,2,2, 3,3,3), col=c(1,2,3, 1,2,3, 1,2,3), x=c(0.5, 0.5, 0.5, 0.4, 0.4, 0.4, 0.3, 0.3, 0.3), cluster_var=0)
> data
# A tibble: 9 x 4
    row   col     x cluster_var
  <dbl> <dbl> <dbl>       <dbl>
1     1     1   0.5           0
2     1     2   0.5           0
3     1     3   0.5           0
4     2     1   0.4           0
5     2     2   0.4           0
6     2     3   0.4           0
7     3     1   0.3           0
8     3     2   0.3           0
9     3     3   0.3           0

Let's say, this is the tibble containing my results. Now I want to select the target grid cells and its eight neighboring cells, e.g. row=2, col=2, and caluculate the variance of x for those nine cells, so I did that:

i_row=2
i_col=2

  data%>%filter(row==(i_row-1) | row == (i_row+1) | row==i_row) %>% 
  filter(col==(i_col-1) | col==(i_col+1) | col==i_col) %>% 
  summarise(var(x))
# A tibble: 1 x 1
  `var(x)`
     <dbl>
1   0.0075

Now I would wnat to store it in data$cluster_var in the row, where row=2 and col=2, so the resulting tibble would be:

> data
# A tibble: 9 x 4
    row   col     x cluster_var
  <dbl> <dbl> <dbl>       <dbl>
1     1     1   0.5           0
2     1     2   0.5           0
3     1     3   0.5           0
4     2     1   0.4           0
5     2     2   0.4           0.0075
6     2     3   0.4           0
7     3     1   0.3           0
8     3     2   0.3           0
9     3     3   0.3           0

Of course, I would need to loop through all possible values of row and col, to fill in the whole column of cluster_var and the real dataset is really big, so I can't do it manually. I tried with mutate, but it did not work as I wanted.

data%>%
  mutate(., cluster_var[row==i_row, col==i_col] = 
  filter(row==(i_row-1) | row == (i_row+1) | row==i_row) %>% 
  filter(col==(i_col-1) | col==(i_col+1) | col==i_col) %>% 
  summarise(var(x)))
Error: unexpected '=' in "data%>%
  mutate(., cluster_var[row==i_row, col==i_col] ="

At the moment, I'm at a loss and would be grateful for help! Edit: some more information about my data: The nine entries I'm interested in are not consecutive. When the target grid has row= 2 and col= 2, the values I would be interested in are: [2,1], [2,3], [1,1], [2,1], [3,1], [1,3], [2,3], [3,3]. In the exemplary data, they are in consecutive rows, but in my real data I have 64 rows and 64 columns, and the value of row is 1 in the first 64 rows, while col increases from 1:64, then row=2 and col is again 1:64 and so on, for 4096 rows in total. So the row number of the results I want to summarize is not linked to the values in row or col

Upvotes: 0

Views: 673

Answers (2)

Lotte Victor
Lotte Victor

Reputation: 89

I found a solution that should work for most purposes here. It's not solely in tidyverse, but does the job. The following code does, what I want:

data=tibble(row=c(1,1,1, 2,2,2, 3,3,3), col=c(1,2,3, 1,2,3, 1,2,3), x=c(0.5, 0.5, 0.5, 0.4, 0.4, 0.4, 0.3, 0.3, 0.3))
cluster_var=numeric(nrow(data))

for(i in 1:max(data$row)){
  for(j in 1:max(data$col)){
    i_row=i
    i_col=j
    position=which(data$row==i_row & data$col==i_col)
    cluster_var_temp= as.numeric(data%>%
                                   filter(row==(i_row-1) | row == (i_row+1) | row==i_row) %>% 
                                   filter(col==(i_col-1) | col==(i_col+1) | col==i_col) %>% 
                                   summarise(var(x)))
    cluster_var[position]=cluster_var_temp

  }

} 

data=cbind(data, cluster_var)

> data
  row col   x cluster_var
1   1   1 0.5 0.003333333
2   1   2 0.5 0.003000000
3   1   3 0.5 0.003333333
4   2   1 0.4 0.008000000
5   2   2 0.4 0.007500000
6   2   3 0.4 0.008000000
7   3   1 0.3 0.003333333
8   3   2 0.3 0.003000000
9   3   3 0.3 0.003333333

Thanks everyone for the help! @Nirbhay Singh, you put me in the right direction. And maybe this helps people who search for this or something similar in the future.

Upvotes: 0

rj-nirbhay
rj-nirbhay

Reputation: 689

As per my understanding, you want to calculate the variance of nine values including the target cell value. The solution can be approched using index value of dataframe and a unique key to get the target cell. below is solution using for loop and dplyr:

df= tibble(row=c(1,1,1, 2,2,2, 3,3,3), col=c(1,2,3, 1,2,3, 1,2,3), x=c(0.5, 0.5, 0.5, 0.4, 0.4, 0.4, 0.3, 0.3, 0.3), cluster_var=0)
l<-c() # empty vector which will be used for stroing variance value
df$RowNumber<- row.names(df) # getting index of row
df$key<-paste0(df$row,",",df$col) # generating key

keyList<- unique(df$key) #list all unique values of key , over this loop will run

for(i in 1:length(keyList)){

  #cat("Running For:",i,'\n')
  rowIndx <- df %>% 
        filter(key==keyList[i]) %>% 
        select(RowNumber) %>%
        as.numeric()

  filterValues <-seq((rowIndx-4):(rowIndx+4)) # getting index for 9 values including target cell

 l[i]<- df %>% 
    filter(RowNumber %in% filterValues) %>% 
     summarise(.,cluster_var =  var(x))

}

df$cluster_var<- unlist(l) # adding calculated variance to data frame

This solution might not be the optimal one.

Upvotes: 1

Related Questions