Sylvain
Sylvain

Reputation: 143

Statistics after conditional filtering

Apologies in advance if this has already been asked elsewhere.

I have multiple data frames (with 25 columns and > 1000 rows) like this :

> head(Amsterdam_C02 <- Amsterdam %>% filter(Chemicals == "CO2"))
  Sample_ID Locality.Name       Chemicals  
1 VKB19xxxxxx     Amsterdam        CO2     
2 VKB19xxxxxx     Amsterdam        CO2     
3 VKB1xxxxxxx     Amsterdam        CO2     
4 VKB1xxxxxxx     Amsterdam        CO2     
5 VKB1xxxxxxx     Amsterdam        CO2     
6 VKB1xxxxxxx     Amsterdam        CO2      
End.Date  Less.Than Activity.Value  Measuring.Unit 
1 2019-01-31  <      1.0714000     g/m³                                                      
2 2019-02-18         3.4609000     g/m³                                                     
3 2019-02-28  <      0.7020623     g/m³                                                      
4 2019-04-25         4.5563282     g/m³                                                      
5 2019-05-20         1.6000000     g/m³                                                       
6 2019-05-22  <      0.6000000     g/m³     

I wish to obtain the mean, max, min and sd of Activity.Value, classified by month, and taken into account the values only when Less.Than is not "<" (the "<" means the value is below the detection limit and will be not kept for statistics). This example displays one or two value(s) per month, but there are hundreds. So, R must return the 4 statistics of Activity.value each month (in a vector or whatever). If there no value above the detection limit for one specific month, then R must return "-" for the 4 statistics.

In addition, I would like that R returns the mean of all the values during the whole year (not classified per month) that have not been into account because Less.Than is "<".

I made different attempts but nothing worked correctly and I prefer to ask for your help.

If Less.Than == "<" ???
To filter per month I have already tried %>% filter(grepl("2019-01") 12 times but if possible, I prefer to avoid doing it manually because I have other data frames with similar analysis to perform on it.

Upvotes: 0

Views: 90

Answers (2)

MarBlo
MarBlo

Reputation: 4522

Unfortunately the number of data you have is very limited. I have taken the second of the data from above - as the first half don't need to be considered for your question.

The column Less.Than is changed with mutate so that there are all NA's introduced where the < is missing. As your data frame is very small, I have added one new data row. The result for sd shows NA because there are not enough data, as you can see by the number of data n in the summary.

Then filter all rows which have NA in End.Date, group by month and use summarise from dplyr.


new edit: the df is a data.frame as you can see at class(df). You may also have a look here. Then I have made both choices. The one is filtering all the NA's in Less.Than. These are the ones which do not have a "<". And the second does the opposite. All grouped by month. Please, keep in mind that I have added one row to the data to get at least one time a sd.

library(tidyverse)

df <- tribble(
  ~End.Date,  ~Less.Than, ~Activity.Value, 
   '2019-01-31',  '<' ,     1.0714000,                                                      
   '2019-02-18',   '' ,     3.4609000,                                                     
   '2019-02-28',  '<' ,     0.7020623,                                                      
   '2019-04-25',  '' ,     4.5563282,                                                      
   '2019-05-20',  '' ,     1.6000000,                                                       
   '2019-05-22',  '<' ,     0.6000000,
   '2019-05-22',  '<' ,     0.7000000
)
df$End.Date <- as.Date(df$End.Date)

df
#> # A tibble: 7 x 3
#>   End.Date   Less.Than Activity.Value
#>   <date>     <chr>              <dbl>
#> 1 2019-01-31 "<"                1.07 
#> 2 2019-02-18 ""                 3.46 
#> 3 2019-02-28 "<"                0.702
#> 4 2019-04-25 ""                 4.56 
#> 5 2019-05-20 ""                 1.6  
#> 6 2019-05-22 "<"                0.6  
#> 7 2019-05-22 "<"                0.7
# here you can see that the df is a data.frame
class(df)
#> [1] "tbl_df"     "tbl"        "data.frame"

df %>% 
  mutate(Less.Than = ifelse(Less.Than != '<', NA, Less.Than)) %>% 
  # what follows filters the rows which contain NA
  dplyr::filter(is.na(Less.Than)) %>% 
  group_by(months(End.Date)) %>% 
  summarise(
    sum = sum(Activity.Value),
    min = min(Activity.Value),
    sd = sd(Activity.Value),
    n = n())
#> # A tibble: 3 x 5
#>   `months(End.Date)`   sum   min    sd     n
#>   <chr>              <dbl> <dbl> <dbl> <int>
#> 1 April               4.56  4.56    NA     1
#> 2 Februar             3.46  3.46    NA     1
#> 3 Mai                 1.6   1.6     NA     1

df %>% 
  mutate(Less.Than = ifelse(Less.Than != '<', NA, Less.Than)) %>% 
  # what follows filters the rows which DO NOT contain NA
  # or in your words these rows possess a "<"
  dplyr::filter(!is.na(Less.Than)) %>% 
  group_by(months(End.Date)) %>% 
  summarise(
    sum = sum(Activity.Value),
    min = min(Activity.Value),
    sd = sd(Activity.Value),
    n = n())
#> # A tibble: 3 x 5
#>   `months(End.Date)`   sum   min      sd     n
#>   <chr>              <dbl> <dbl>   <dbl> <int>
#> 1 Februar            0.702 0.702 NA          1
#> 2 Januar             1.07  1.07  NA          1
#> 3 Mai                1.30  0.6    0.0707     2

Created on 2020-06-15 by the reprex package (v0.3.0)

Upvotes: 1

Chris Ruehlemann
Chris Ruehlemann

Reputation: 21400

You can use aggregate:

Reproducible data:

df <- data.frame(
  Date = c("2019-01-31", "2019-02-18", "2019-02-28", "2019-04-25", "2019-05-20", "2019-05-02"),
  Less.than = c("", "<", "", "<", "", ""),
  Activity.level = c(1.0714000, 3.4609000, 0.7020623, 0.7020623, 4.5563282, 1.6000000)) 

Solution to obtain mean:

aggregate(df$Activity.level[!df$Less.than=="<"], by = list(sub("-(\\d+)-", "-\\1-", df$Date[!df$Less.than=="<"])), mean)
     Group.1         x
1 2019-01-31 1.0714000
2 2019-02-28 0.7020623
3 2019-05-02 1.6000000
4 2019-05-20 4.5563282

For the other statistics, replace mean accordingly.

Upvotes: 0

Related Questions