Reputation: 3502
I have this data.frame
DATA
df <- data.frame(id=c(rep("site1", 3), rep("site2", 8), rep("site3", 9), rep("site4", 15)),
major_rock = c("greywacke", "mudstone", "gravel", "greywacke", "gravel", "mudstone", "gravel", "mudstone", "mudstone",
"conglomerate", "gravel", "mudstone", "greywacke","conglomerate", "gravel", "gravel", "greywacke","gravel",
"greywacke", "gravel", "mudstone", "greywacke", "gravel", "gravel", "gravel", "conglomerate", "greywacke",
"coquina", "gravel", "gravel", "greywacke", "gravel", "mudstone","mudstone", "gravel"),
minor_rock = c("sandstone mudstone basalt chert limestone", "limestone", "sand silt clay", "sandstone mudstone basalt chert limestone",
"sand silt clay", "sandstone conglomerate coquina tephra", NA, "limestone", "mudstone sandstone coquina limestone",
"sandstone mudstone limestone", "sand loess silt", "sandstone conglomerate coquina tephra", "sandstone mudstone basalt chert limestone",
"sandstone mudstone limestone", "sand loess silt", "loess silt sand", "sandstone mudstone conglomerate chert limestone basalt",
"sand silt clay", "sandstone mudstone conglomerate", "loess sand silt", "sandstone conglomerate coquina tephra", "sandstone mudstone basalt chert limestone",
"sand loess silt", "sand silt clay", "loess silt sand", "sandstone mudstone limestone", "sandstone mudstone conglomerate chert limestone basalt",
"limestone", "loess sand silt", NA, "sandstone mudstone conglomerate", "sandstone siltstone mudstone limestone silt lignite", "limestone",
"mudstone sandstone coquina limestone", "mudstone tephra loess"),
area_ha = c(1066.68, 7.59, 3.41, 4434.76, 393.16, 361.69, 306.75, 124.93, 95.84, 9.3, 8.45, 4565.89, 2600.44, 2198.52,
2131.71, 2050.09, 1640.47, 657.09, 296.73, 178.12, 10403.53, 8389.2, 8304.08, 3853.36, 2476.36, 2451.25,
1640.47, 1023.02, 532.94, 385.68, 296.73, 132.45, 124.93, 109.12, 4.87))
In it, there are 4 sites, 2 of them are independent (site1
and site3
; they don't include any site upstream) and 2 are dependent (site2
and site4
; they include upstream site(s))
I want to create a new data.frame, let's call it df_indep
. In which, I want all sites to be independent which means subtracting any upstream site(s) from the dependent sites as below
site1 and site3 will remain the same as they are independent
site2 (independent) = site2 - site1
site4 (independent) = site4 -(site2+site3)
Below is df
only for major_rock
and minor_rock
combinations that have area_percent
larger than 15% (before subtracting upstream sites; site2
and site3
)
library(dplyr)
head(df %>% group_by(id) %>%
mutate(area_percent = area_ha/sum(area_ha)*100) %>%
filter(area_percent>5),15)
# id major_rock minor_rock area_ha area_percent
# <fctr> <fctr> <fctr> <dbl> <dbl>
#1 site1 greywacke sandstone mudstone basalt chert limestone 1066.68 98.979289
#2 site2 greywacke sandstone mudstone basalt chert limestone 4434.76 77.329604
#3 site2 gravel sand silt clay 393.16 6.855592
#4 site2 mudstone sandstone conglomerate coquina tephra 361.69 6.306845
#5 site2 gravel NA 306.75 5.348848
#6 site3 mudstone sandstone conglomerate coquina tephra 4565.89 27.978879
#7 site3 greywacke sandstone mudstone basalt chert limestone 2600.44 15.934986
#8 site3 conglomerate sandstone mudstone limestone 2198.52 13.472099
#9 site3 gravel sand loess silt 2131.71 13.062701
#10 site3 gravel loess silt sand 2050.09 12.562550
#11 site3 greywacke sandstone mudstone conglomerate chert limestone basalt 1640.47 10.052479
#12 site4 mudstone sandstone conglomerate coquina tephra 10403.53 25.925869
#13 site4 greywacke sandstone mudstone basalt chert limestone 8389.20 20.906106
#14 site4 gravel sand loess silt 8304.08 20.693984
#15 site4 gravel sand silt clay 3853.36 9.602674
and here is the
FINAL RESULT
I want after subtracting upstream sites
# id major_rock minor_rock area_ha area_percent
#1 site1 greywacke sandstone mudstone basalt chert limestone 1066.68 98.979289
#2 site2 greywacke sandstone mudstone basalt chert limestone 3368.08 72.319849
#3 site2 gravel sand silt clay 389.75 8.368762
#4 site2 mudstone sandstone conglomerate coquina tephra 361.69 7.766254
#5 site2 gravel NA 306.75 6.586576
#6 site3 mudstone sandstone conglomerate coquina tephra 4565.89 27.978879
#7 site3 greywacke sandstone mudstone basalt chert limestone 2600.44 15.934986
#8 site3 conglomerate sandstone mudstone limestone 2198.52 13.472099
#9 site3 gravel sand loess silt 2131.71 13.062701
#10 site3 gravel loess silt sand 2050.09 12.562550
#11 site3 greywacke sandstone mudstone conglomerate chert limestone basalt 1640.47 10.052479
#12 site4 mudstone sandstone conglomerate coquina tephra 5475.95 30.297305
#13 site4 greywacke sandstone mudstone basalt chert limestone 1354.00 7.491403
#14 site4 gravel sand loess silt 6163.92 34.103701
#15 site4 gravel sand silt clay 2803.11 15.509031
I will appreciate any suggestions on how to do this in R.
UPDATE
This is a map showing all 4 sites
The figure below shows site4
(accumulative as in df
) and site1
(independent) as I want in the final output after subtracting site2
and site3
The figure below shows the same for site2 (accumulative) and indepenendent
Regarding @rbierman's question of how sites dependcies are encoded, please check below.
# id dependent dep_site1 dep_site2 dep_site3
#1 site1 no no no no
#2 site1 no no no no
#3 site1 no no no no
#4 site2 yes yes no no
#5 site2 yes yes no no
#6 site2 yes yes no no
#7 site2 yes yes no no
#8 site2 yes yes no no
#9 site2 yes yes no no
#10 site2 yes yes no no
#11 site2 yes yes no no
#12 site3 no no no no
#13 site3 no no no no
#14 site3 no no no no
#15 site3 no no no no
#16 site3 no no no no
#17 site3 no no no no
#18 site3 no no no no
#19 site3 no no no no
#20 site3 no no no no
#21 site4 yes yes yes yes
#22 site4 yes yes yes yes
#23 site4 yes yes yes yes
#24 site4 yes yes yes yes
#25 site4 yes yes yes yes
#26 site4 yes yes yes yes
#27 site4 yes yes yes yes
#28 site4 yes yes yes yes
#29 site4 yes yes yes yes
#30 site4 yes yes yes yes
#31 site4 yes yes yes yes
#32 site4 yes yes yes yes
#33 site4 yes yes yes yes
#34 site4 yes yes yes yes
#35 site4 yes yes yes yes
Upvotes: 3
Views: 125
Reputation: 146224
This isn't too bad, just a little renaming and joining.
First we'll want dependencies in a nice two-column format. You can use reshape2::melt
or tidyr::gather
on the wide dependencies you posted to make them long:
deps = data.frame(
id = c("site2", "site4", "site4"),
dependency = c("site1", "site2", "site3"),
stringsAsFactors = FALSE
)
# id dependency
# 1 site2 site1
# 2 site4 site2
# 3 site4 site3
Using dplyr
for joins, we'll also want character
rather than factor
columns just in case the levels aren't all the same.
library(dplyr)
df = mutate_at(df, .cols = c("id", "major_rock", "minor_rock"), .funs = funs(as.character))
First we create a "dependency with measure" data frame that has a clearly dependent name for the area and id (edit) and then we aggregate it to the id
level, summing the dependent areas:
dep_w_measure = df %>%
select(dependency = id, major_rock, minor_rock, dep_area = area_ha) %>%
inner_join(deps) %>%
group_by(id, major_rock, minor_rock) %>%
summarize(dep_area = sum(dep_area))
Then we join that to the original data, and subtract off the dependent area (where present):
result = left_join(df, dep_w_measure, by = c("major_rock", "minor_rock", "id")) %>%
mutate(area_ind = area_ha - coalesce(dep_area, 0))
head(result)
# id major_rock minor_rock area_ha dep_area area_ind
# 1 site1 greywacke sandstone mudstone basalt chert limestone 1066.68 NA 1066.68
# 2 site1 mudstone limestone 7.59 NA 7.59
# 3 site1 gravel sand silt clay 3.41 NA 3.41
# 4 site2 greywacke sandstone mudstone basalt chert limestone 4434.76 1066.68 3368.08
# 5 site2 gravel sand silt clay 393.16 3.41 389.75
# 6 site2 mudstone sandstone conglomerate coquina tephra 361.69 NA 361.69
I left the dep_area
and area_ha
columns in to "show my work", you can clean it up as needed. The independent area area_ind
column matches the area_ha
in your desired output.
Upvotes: 1