Jessm
Jessm

Reputation: 21

How to filter each column of data in a dataframe by a list of respective outliers

I have a dataset of timestamps with respective values (these are not real, just an example):

Name time NOx ozone PM25 temp rain wind methane benzene xylene
1    80  40    62    3   32   79   36      53       8     91
2    13  72    40   73    1   77   36      85      66     25
3    87  83    93    2   51   28   96      28      66     37
4     6  53     3   96   42   57   75      77      77     50
5    38  61    85   36    2    4   43      17      20     59
6    54  15    75   72   81  100   72      62      86     40
7    36   5    70   44   99    1   62      49      48     66
8    73  97     7   87    7    4   59      97      49     36
9    30  89    58   28   91   72   94      85      68     25
10   12  51    78   17   82   67   52      13      67     60

And I have a separate table of some statistics I performed using "summarise", including the threshold for outliers for each parameter (all columns except "Name" and "time"):

 NOx ozone PM25 temp rain wind methane benzene xylene
53    73   67   26   59   52      45      70     75

I want to create a new table that shows the Name, Time, parameter, and value where any specific value was greater than its outlier threshold. I tried this by making both of these dataframes to the "longer" format using pivot_longer and now my main dataframe looks like this:

Name time parameter value
 1    80 NOx          40
 2    80 ozone        62
 3    80 PM25          3
 4    80 temp         32
 5    80 rain         79
 6    80 wind         36
 7    80 methane      53

And my outliers dataframe looks like this:

 parameter value
 NOx          53
 ozone        73
 PM25         67
 temp         26
 rain         59
 wind         52
 methane      45
 benzene      70
 xylene       75

I want my new table of outliers to look like the above "longer" dataframe with the respective "Name" and "time" with each parameter and value. I tried doing this using "filter":

new_a <- A_long %>%
   group_by(parameter) %>%
   filter(A_long$value > B_long$value)
Error: Result must have length 50, not 450

I did receive an error message trying this but I also don't think this code I tried will give me the table I want. Any advice?

Upvotes: 1

Views: 57

Answers (1)

Martin Gal
Martin Gal

Reputation: 16998

Suppose your original dataframe is named df and your outliers are stored in df_outl.

library(dplyr)
library(tidyr)

df %>%
  pivot_longer(cols=-c("Name", "time"), names_to="parameter") %>%
  left_join(df_outl, by="parameter") %>%
  filter(value.x >= value.y) %>%
  select(Name, time, parameter, value:=value.x, -value.y)

which returns

# A tibble: 43 x 4
    Name  time parameter value
   <dbl> <dbl> <chr>     <dbl>
 1     1    80 temp         32
 2     1    80 rain         79
 3     1    80 methane      53
 4     1    80 xylene       91
 5     2    13 NOx          72
 6     2    13 PM25         73
 7     2    13 rain         77
 8     2    13 methane      85
 9     3    87 NOx          83
10     3    87 ozone        93

Upvotes: 1

Related Questions