Reputation: 143
Apologies in advance if this has already been asked elsewhere.
I have multiple data frames (with 25 columns and > 1000 rows) like this :
> head(Amsterdam_C02 <- Amsterdam %>% filter(Chemicals == "CO2"))
Sample_ID Locality.Name Chemicals
1 VKB19xxxxxx Amsterdam CO2
2 VKB19xxxxxx Amsterdam CO2
3 VKB1xxxxxxx Amsterdam CO2
4 VKB1xxxxxxx Amsterdam CO2
5 VKB1xxxxxxx Amsterdam CO2
6 VKB1xxxxxxx Amsterdam CO2
End.Date Less.Than Activity.Value Measuring.Unit
1 2019-01-31 < 1.0714000 g/m³
2 2019-02-18 3.4609000 g/m³
3 2019-02-28 < 0.7020623 g/m³
4 2019-04-25 4.5563282 g/m³
5 2019-05-20 1.6000000 g/m³
6 2019-05-22 < 0.6000000 g/m³
I wish to obtain the mean, max, min and sd of Activity.Value, classified by month, and taken into account the values only when Less.Than is not "<" (the "<" means the value is below the detection limit and will be not kept for statistics). This example displays one or two value(s) per month, but there are hundreds. So, R must return the 4 statistics of Activity.value each month (in a vector or whatever). If there no value above the detection limit for one specific month, then R must return "-" for the 4 statistics.
In addition, I would like that R returns the mean of all the values during the whole year (not classified per month) that have not been into account because Less.Than is "<".
I made different attempts but nothing worked correctly and I prefer to ask for your help.
If Less.Than == "<"
???
To filter per month I have already tried %>% filter(grepl("2019-01")
12 times but if possible, I prefer to avoid doing it manually because I have other data frames with similar analysis to perform on it.
Upvotes: 0
Views: 90
Reputation: 4522
Unfortunately the number of data you have is very limited. I have taken the second of the data from above - as the first half don't need to be considered for your question.
The column Less.Than is changed with mutate
so that there are all NA's introduced where the < is missing. As your data frame is very small, I have added one new data row. The result for sd
shows NA because there are not enough data, as you can see by the number of data n
in the summary.
Then filter all rows which have NA in End.Date, group by month and use summarise
from dplyr
.
class(df)
. You may also have a look here.
Then I have made both choices. The one is filtering all the NA's in Less.Than. These are the ones which do not have a "<". And the second does the opposite.
All grouped by month. Please, keep in mind that I have added one row to the data to get at least one time a sd.
library(tidyverse)
df <- tribble(
~End.Date, ~Less.Than, ~Activity.Value,
'2019-01-31', '<' , 1.0714000,
'2019-02-18', '' , 3.4609000,
'2019-02-28', '<' , 0.7020623,
'2019-04-25', '' , 4.5563282,
'2019-05-20', '' , 1.6000000,
'2019-05-22', '<' , 0.6000000,
'2019-05-22', '<' , 0.7000000
)
df$End.Date <- as.Date(df$End.Date)
df
#> # A tibble: 7 x 3
#> End.Date Less.Than Activity.Value
#> <date> <chr> <dbl>
#> 1 2019-01-31 "<" 1.07
#> 2 2019-02-18 "" 3.46
#> 3 2019-02-28 "<" 0.702
#> 4 2019-04-25 "" 4.56
#> 5 2019-05-20 "" 1.6
#> 6 2019-05-22 "<" 0.6
#> 7 2019-05-22 "<" 0.7
# here you can see that the df is a data.frame
class(df)
#> [1] "tbl_df" "tbl" "data.frame"
df %>%
mutate(Less.Than = ifelse(Less.Than != '<', NA, Less.Than)) %>%
# what follows filters the rows which contain NA
dplyr::filter(is.na(Less.Than)) %>%
group_by(months(End.Date)) %>%
summarise(
sum = sum(Activity.Value),
min = min(Activity.Value),
sd = sd(Activity.Value),
n = n())
#> # A tibble: 3 x 5
#> `months(End.Date)` sum min sd n
#> <chr> <dbl> <dbl> <dbl> <int>
#> 1 April 4.56 4.56 NA 1
#> 2 Februar 3.46 3.46 NA 1
#> 3 Mai 1.6 1.6 NA 1
df %>%
mutate(Less.Than = ifelse(Less.Than != '<', NA, Less.Than)) %>%
# what follows filters the rows which DO NOT contain NA
# or in your words these rows possess a "<"
dplyr::filter(!is.na(Less.Than)) %>%
group_by(months(End.Date)) %>%
summarise(
sum = sum(Activity.Value),
min = min(Activity.Value),
sd = sd(Activity.Value),
n = n())
#> # A tibble: 3 x 5
#> `months(End.Date)` sum min sd n
#> <chr> <dbl> <dbl> <dbl> <int>
#> 1 Februar 0.702 0.702 NA 1
#> 2 Januar 1.07 1.07 NA 1
#> 3 Mai 1.30 0.6 0.0707 2
Created on 2020-06-15 by the reprex package (v0.3.0)
Upvotes: 1
Reputation: 21400
You can use aggregate
:
Reproducible data:
df <- data.frame(
Date = c("2019-01-31", "2019-02-18", "2019-02-28", "2019-04-25", "2019-05-20", "2019-05-02"),
Less.than = c("", "<", "", "<", "", ""),
Activity.level = c(1.0714000, 3.4609000, 0.7020623, 0.7020623, 4.5563282, 1.6000000))
Solution to obtain mean:
aggregate(df$Activity.level[!df$Less.than=="<"], by = list(sub("-(\\d+)-", "-\\1-", df$Date[!df$Less.than=="<"])), mean)
Group.1 x
1 2019-01-31 1.0714000
2 2019-02-28 0.7020623
3 2019-05-02 1.6000000
4 2019-05-20 4.5563282
For the other statistics, replace mean
accordingly.
Upvotes: 0