Reputation: 89
I'm tring to add the results of a summary statistic to a tibble at a specified position in another column of the same tibble, after filtering all my data for the subset I want to summarize.
It contains results on a spatially explicit simulation in a landscape of grid cells. I have a column specifiying the column and the row of the landscape and a column for the results.
What I want to do is take the target grid cell e.g. row= 2, col =2
and calculate the variance of the target cell and its eight surrounding cells. The result of this should be stored in a new column of the dataframe, in the row of the target grid cell. The filtering of the data to select only the nine patches I'm interested in works alright, but storing the result in the new column at the specific location is not working. I need a general solution, since I want to loop through all the grid cells (given by row and col) and also loop through several tibbles, containing similar data but for different landscapes
I'm putting example code here, my real dataset is much larger.
data= tibble(row=c(1,1,1, 2,2,2, 3,3,3), col=c(1,2,3, 1,2,3, 1,2,3), x=c(0.5, 0.5, 0.5, 0.4, 0.4, 0.4, 0.3, 0.3, 0.3), cluster_var=0)
> data
# A tibble: 9 x 4
row col x cluster_var
<dbl> <dbl> <dbl> <dbl>
1 1 1 0.5 0
2 1 2 0.5 0
3 1 3 0.5 0
4 2 1 0.4 0
5 2 2 0.4 0
6 2 3 0.4 0
7 3 1 0.3 0
8 3 2 0.3 0
9 3 3 0.3 0
Let's say, this is the tibble containing my results. Now I want to select the target grid cells and its eight neighboring cells, e.g. row=2, col=2, and caluculate the variance of x for those nine cells, so I did that:
i_row=2
i_col=2
data%>%filter(row==(i_row-1) | row == (i_row+1) | row==i_row) %>%
filter(col==(i_col-1) | col==(i_col+1) | col==i_col) %>%
summarise(var(x))
# A tibble: 1 x 1
`var(x)`
<dbl>
1 0.0075
Now I would wnat to store it in data$cluster_var
in the row, where row=2
and col=2
, so the resulting tibble would be:
> data
# A tibble: 9 x 4
row col x cluster_var
<dbl> <dbl> <dbl> <dbl>
1 1 1 0.5 0
2 1 2 0.5 0
3 1 3 0.5 0
4 2 1 0.4 0
5 2 2 0.4 0.0075
6 2 3 0.4 0
7 3 1 0.3 0
8 3 2 0.3 0
9 3 3 0.3 0
Of course, I would need to loop through all possible values of row
and col
, to fill in the whole column of cluster_var
and the real dataset is really big, so I can't do it manually.
I tried with mutate
, but it did not work as I wanted.
data%>%
mutate(., cluster_var[row==i_row, col==i_col] =
filter(row==(i_row-1) | row == (i_row+1) | row==i_row) %>%
filter(col==(i_col-1) | col==(i_col+1) | col==i_col) %>%
summarise(var(x)))
Error: unexpected '=' in "data%>%
mutate(., cluster_var[row==i_row, col==i_col] ="
At the moment, I'm at a loss and would be grateful for help!
Edit: some more information about my data:
The nine entries I'm interested in are not consecutive. When the target grid has row= 2
and col= 2
, the values I would be interested in are: [2,1], [2,3], [1,1], [2,1], [3,1], [1,3], [2,3], [3,3]
. In the exemplary data, they are in consecutive rows, but in my real data I have 64 rows and 64 columns, and the value of row
is 1 in the first 64 rows, while col
increases from 1:64
, then row=2
and col is again 1:64
and so on, for 4096 rows in total. So the row number of the results I want to summarize is not linked to the values in row
or col
Upvotes: 0
Views: 673
Reputation: 89
I found a solution that should work for most purposes here. It's not solely in tidyverse, but does the job. The following code does, what I want:
data=tibble(row=c(1,1,1, 2,2,2, 3,3,3), col=c(1,2,3, 1,2,3, 1,2,3), x=c(0.5, 0.5, 0.5, 0.4, 0.4, 0.4, 0.3, 0.3, 0.3))
cluster_var=numeric(nrow(data))
for(i in 1:max(data$row)){
for(j in 1:max(data$col)){
i_row=i
i_col=j
position=which(data$row==i_row & data$col==i_col)
cluster_var_temp= as.numeric(data%>%
filter(row==(i_row-1) | row == (i_row+1) | row==i_row) %>%
filter(col==(i_col-1) | col==(i_col+1) | col==i_col) %>%
summarise(var(x)))
cluster_var[position]=cluster_var_temp
}
}
data=cbind(data, cluster_var)
> data
row col x cluster_var
1 1 1 0.5 0.003333333
2 1 2 0.5 0.003000000
3 1 3 0.5 0.003333333
4 2 1 0.4 0.008000000
5 2 2 0.4 0.007500000
6 2 3 0.4 0.008000000
7 3 1 0.3 0.003333333
8 3 2 0.3 0.003000000
9 3 3 0.3 0.003333333
Thanks everyone for the help! @Nirbhay Singh, you put me in the right direction. And maybe this helps people who search for this or something similar in the future.
Upvotes: 0
Reputation: 689
As per my understanding, you want to calculate the variance of nine values including the target cell value. The solution can be approched using index value of dataframe and a unique key to get the target cell. below is solution using for loop and dplyr:
df= tibble(row=c(1,1,1, 2,2,2, 3,3,3), col=c(1,2,3, 1,2,3, 1,2,3), x=c(0.5, 0.5, 0.5, 0.4, 0.4, 0.4, 0.3, 0.3, 0.3), cluster_var=0)
l<-c() # empty vector which will be used for stroing variance value
df$RowNumber<- row.names(df) # getting index of row
df$key<-paste0(df$row,",",df$col) # generating key
keyList<- unique(df$key) #list all unique values of key , over this loop will run
for(i in 1:length(keyList)){
#cat("Running For:",i,'\n')
rowIndx <- df %>%
filter(key==keyList[i]) %>%
select(RowNumber) %>%
as.numeric()
filterValues <-seq((rowIndx-4):(rowIndx+4)) # getting index for 9 values including target cell
l[i]<- df %>%
filter(RowNumber %in% filterValues) %>%
summarise(.,cluster_var = var(x))
}
df$cluster_var<- unlist(l) # adding calculated variance to data frame
This solution might not be the optimal one.
Upvotes: 1