Reputation: 548
I don't understand spatial.data at all. I have been studying but I'm missing something.
What I have: data.frame enterprises
with the columns: id, parent_subsidiary, city_cod.
What I need: the mean and the max distance from the parent's city to the subsidiary cities.
Ex:
id | mean_dist | max_dist
1111 | 25km | 50km
232 | 110km | 180km
333 | 0km | 0km
What I did :
library("tidyverse")
library("sf")
# library("brazilmaps") not working anymore
library("geobr")
parent <- enterprises %>% filter(parent_subsidiary==1)
subsidiary <- enterprises %>% filter(parent_subsidiary==2)
# Cities - polygons
m_city_br <- read_municipality(code_muni="all", year=2019)
# or shp_city<- st_read("/BR_Municipios_2019.shp")
# data.frame with the column geom
map_parent <- left_join(parent, m_city_br, by=c("city_cod"="code_muni"))
map_subsidiary <- left_join(subsidiary, m_city_br, by=c("city_cod"="code_muni"))
st_distance(map_parent$geom[1],map_subsidiary$geom[2]) %>% units::set_units(km)
# it took a long time and the result is different from google.maps
# is it ok?!
# To do by ID -- I also stucked here
distance_p_s <- data.frame(id=as.numeric(),subsidiar=as.numeric(),mean_dist=as.numeric(),max_dist=as.numeric())
id_v <- as.vector(parent$id)
for (i in 1:length(id_v)){
test_p <- map_parent %>% filter(id==id_v[i])
test_s <- map_subsidiary %>% filter(id==id_v[i])
total <- 0
value <- 0
max <- 0
l <- 0
l <- nrow(test_s)
for (j in 1:l){
value <- as.numeric(round(st_distance(test_p$geom[1],test_s$geom[j]) %>% units::set_units(km),2))
total <- total + value
ifelse(value>max,max<-value,NA)
}
mean_dist <- total/l
done <- data.frame(id=id[i],subsidiary=l,mean_dist=round(mean_dist,2),max_dist=max)
distance_p_s <- rbind(distance_p_s,done)
rm(done)
}
}
Is it right? Can I calculate the centroid of the cities and than calculate the distance?
I realize that the distance from code_muni==4111407 to code_muni==4110102, the distance is 0, but is another city (Imbituva, PR,Brasil - Ivaí, PR,Brasil). Why?
Data example: structure(list(id = c("1111", "1111", "1111", "1111", "232", "232", "232", "232", "3123", "3123", "4455", "4455", "686", "333", "333", "14112", "14112", "14112", "3633", "3633"), parent_subsidiary = c("1","2", "2", "2", "1", "2", "2", "2", "1", "2", "1", "2", "1", "2", "1", "1", "2", "2", "1", "2"), city_cod = c(4305801L,4202404L, 4314803L, 4314902L, 4318705L, 1303403L, 4304507L, 4314100L, 2408102L, 3144409L, 5208707L, 4205407L, 5210000L, 3203908L, 3518800L, 3118601L, 4217303L, 3118601L, 5003702L, 5205109L)), row.names = c(NA, 20L), class = "data.frame")
PS: this is Brazilian cities https://github.com/ipeaGIT/geobr/tree/master/r-package
Upvotes: 0
Views: 77
Reputation: 548
I did something like that:
distance_p_s <- data.frame(id=as.character(),
qtd_subsidiary=as.numeric(),
dist_min=as.numeric(),
dist_media=as.numeric(),
dist_max=as.numeric())
id <- as.vector(mparentid$id)
for (i in 1:length(id)){
eval(parse(text=paste0("
print('Filtering id: ",id[i]," (",i," of ",length(id),")')
")))
teste_m <- mparentid %>% filter(id==id[i]) %>% st_as_sf()
teste_f <- msubsidiaryid %>% filter(id==id[i]) %>% st_as_sf()
teste_f <- st_centroid(teste_f)
teste_m <- st_centroid(teste_m)
teste_f = st_transform(teste_f, 4674)
teste_m = st_transform(teste_m, 4674)
total <- 0
value <- 0
min <- 0
max <- 0
l <- 0
l <- nrow(teste_f)
for (j in 1:l){
eval(parse(text=paste0("
print('Tratando id: ",id[i]," (",i," de ",length(id),"), subsidiary: ",j," de ",l,"')
")))
value <- as.numeric(round(st_distance(teste_m$geom[1],teste_f$geom[j]) %>% units::set_units(km),2))
total <- total + value
ifelse(value>max,max<-value,NA)
if(j==1){
min<-value
} else {
ifelse(value<min,min<-value,NA)}
}
dist_med <- total/l
done <- data.frame(id=id[i],qtd_subsidiary=l,dist_min=min,dist_media=round(dist_med,2),dist_max=max)
distance_p_s <- rbind(distance_p_s,done)
eval(parse(text=paste0("
print('Concluido id: ",id[i]," (",i," de ",length(id),"), subsidiary: ",j," de ",l,"')
")))
rm(done)
}
Probably this is not the best way, but it solved my problem for now.
Upvotes: 0
Reputation: 1388
Great problem. I looked at it for a little while. Then I came back and looked some more after thinking about it. The mean was not calculated. Only the distances were determined from each parent to its subsidiaries.
The data was binded - the cities data and the data frame data. Then the new df was mutated to add the centroid data for each point on the surface.
The df was split by id and resulted in a list of 8 df's. Each df contained separate parent with related subsidiaries. (1:4, 1:3, 1:4, 1:2, .... )
A loop with a function cleaned up the 8 df's, and calculated the distance from each parent to each subsidiary.
I checked the distance of the first df in the list against values for distances from a website. The distances of df1 were nearly identical to the website.
The output is shown at [link]
Upvotes: 1