Miguel 2488
Miguel 2488

Reputation: 1440

Removing outliers by filtering values in R

i have a dataframe like this :

         ds        y
1   2015-12-31 35.59050
2   2016-01-01 28.75111
3   2016-01-04 25.53158
4   2016-01-06 17.75369
5   2016-01-07 29.01500
6   2016-01-08 29.22663
7   2016-01-09 29.05249
8   2016-01-10 27.54387
9   2016-01-11 28.05674
10  2016-01-12 29.00901
11  2016-01-13 31.66441
12  2016-01-14 29.18520
13  2016-01-15 29.79364
14  2016-01-16 30.07852

i'm trying to create a loop that remove the rows which values in the 'ds' column are above 34 or below 26, because there is where my outliers are:

for (i in grupo$y){if (i < 26) {grupo$y[i] = NA}}

i tried this to remove those below 26, i don't get any errors, but those rows won't go.

Any suggestions about how to remove those outliers??

Thanks in advance

Upvotes: 0

Views: 3475

Answers (2)

camille
camille

Reputation: 16862

Here are a base R solution and a tidyverse solution. Part of the strength of R is that for a problem such as this one, R's default of working across vectors means you often don't need a for loop. The issue is that in your loop, you're assigning values to NA. That doesn't actually get rid of those values, it just gives them the value NA.

In base R, you can use subset to get the rows or columns of a data frame that meet certain criteria:

subset(grupo, y >= 26 & y <= 34)
#> # A tibble: 11 x 2
#>    ds             y
#>    <date>     <dbl>
#>  1 2016-01-01  28.8
#>  2 2016-01-07  29.0
#>  3 2016-01-08  29.2
#>  4 2016-01-09  29.1
#>  5 2016-01-10  27.5
#>  6 2016-01-11  28.1
#>  7 2016-01-12  29.0
#>  8 2016-01-13  31.7
#>  9 2016-01-14  29.2
#> 10 2016-01-15  29.8
#> 11 2016-01-16  30.1

Or using dplyr functions, you can filter your data similarly, and make use of dplyr::between. between(y, 26, 34) is a shorthand for y >= 26 & y <= 34.

library(dplyr)

grupo %>%
  filter(between(y, 26, 34))
#> # A tibble: 11 x 2
#>    ds             y
#>    <date>     <dbl>
#>  1 2016-01-01  28.8
#>  2 2016-01-07  29.0
#>  3 2016-01-08  29.2
#>  4 2016-01-09  29.1
#>  5 2016-01-10  27.5
#>  6 2016-01-11  28.1
#>  7 2016-01-12  29.0
#>  8 2016-01-13  31.7
#>  9 2016-01-14  29.2
#> 10 2016-01-15  29.8
#> 11 2016-01-16  30.1

Upvotes: 3

Lennyy
Lennyy

Reputation: 6132

With dplyr you could do:

library(dplyr)
df %>% 
filter(y >= 26 & y <= 34)

       ds        y
1  2016-01-01 28.75111
2  2016-01-07 29.01500
3  2016-01-08 29.22663
4  2016-01-09 29.05249
5  2016-01-10 27.54387
6  2016-01-11 28.05674
7  2016-01-12 29.00901
8  2016-01-13 31.66441
9  2016-01-14 29.18520
10 2016-01-15 29.79364
11 2016-01-16 30.07852

Upvotes: 2

Related Questions