Tymo
Tymo

Reputation: 259

Remove values from dataframe with loop r

I have next question. Its rather simple but I cant figure out where I do wrong. I have one year of observations, but the problem is that in the end of the year after 31.12.16 i have shift to 01.01.17 and then repetitions from 2016 year. This can be clearly seen if you look at lines 8572-8574. Therefore, I want to remove those values(everything after 2017 with 2017). My overall database consists of many years to analyse. Therefore manual removing is not the option.

           dato horiginal hour mday mon year wday wk
8569 31.12.2016        64   20   31  11  116    6 53
8570 31.12.2016        70   21   31  11  116    6 53
8571 31.12.2016        71   22   31  11  116    6 53
8572 31.12.2016        71   23   31  11  116    6 53
8573 01.01.2017        78    0    1   0  117    0 53
8574 30.10.2016        46   12   30   9  116    0 44
8575 30.10.2016        38   13   30   9  116    0 44
8576 30.10.2016        35   14   30   9  116    0 44
8577 30.10.2016        36   15   30   9  116    0 44
8578 30.10.2016        46   12   30   9  116    0 44
8579 30.10.2016        38   13   30   9  116    0 44
'data.frame':   8629 obs. of  8 variables:
 $ dato     : chr  "01.01.2016" "01.01.2016" "01.01.2016" "01.01.2016" ...
 $ horiginal: num  76 79 78 74 75 71 74 72 71 77 ...
 $ hour     : int  1 2 3 4 5 6 7 8 9 10 ...
 $ mday     : int  1 1 1 1 1 1 1 1 1 1 ...
 $ mon      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ year     : int  116 116 116 116 116 116 116 116 116 116 ...
 $ wday     : int  5 5 5 5 5 5 5 5 5 5 ...
 $ wk       : num  1 1 1 1 1 1 1 1 1 1 ...

The last column "wk" is meant week and it the only increasing column in this dataframe (started from 1). My thought process is to make a loop and remove everything if observation is lower than previous values.

I know that looping here probably is not very efficient way, but nothing comes to my mind so far... so, here is my code:

for (i in 1:NROW(newdf_heat$wk)) {
  total.coefs = data.frame()
  if (newdf_heat$wk[i+1]< newdf_heat$wk[i]) {
   total.coefs = newdf_heat[-c(i:nrow(newdf_heat)),]}
  }

Im doing something wrong because nothing happens. I would appreciate any help.

Upvotes: 1

Views: 271

Answers (2)

kath
kath

Reputation: 7724

Has you have seveal year, you could the following:

# Extract the cumulative maximum year
my_df$max_year <- cummax(my_df$year)

# Filter those rows which are at least as big as the max_year
my_df[my_df$year >= my_df$max_year, ]

#            dato horiginal hour mday mon year wday wk max_year
# 8569 31.12.2016        64   20   31  11  116    6 53      116
# 8570 31.12.2016        70   21   31  11  116    6 53      116
# 8571 31.12.2016        71   22   31  11  116    6 53      116
# 8572 31.12.2016        71   23   31  11  116    6 53      116
# 8573 01.01.2017        78    0    1   0  117    0 53      117

If wk is always increasing you could also use this instead of year, or extract the year as Ronak Shah did it in his answer.

Data

my_df <- 
  structure(list(dato = c("31.12.2016", "31.12.2016", "31.12.2016", "31.12.2016", "01.01.2017", "30.10.2016", "30.10.2016", "30.10.2016", "30.10.2016", "30.10.2016", "30.10.2016"), 
                 horiginal = c(64L, 70L, 71L, 71L, 78L, 46L, 38L, 35L, 36L, 46L, 38L), 
                 hour = c(20L, 21L, 22L, 23L, 0L, 12L, 13L, 14L, 15L, 12L, 13L), 
                 mday = c(31L, 31L, 31L, 31L, 1L, 30L, 30L, 30L, 30L, 30L, 30L), 
                 mon = c(11L, 11L, 11L, 11L, 0L, 9L, 9L, 9L, 9L, 9L, 9L), 
                 year = c(116L, 116L, 116L, 116L, 117L, 116L, 116L, 116L, 116L, 116L, 116L), 
                 wday = c(6L, 6L, 6L, 6L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), 
                 wk = c(53L, 53L, 53L, 53L, 53L, 44L, 44L, 44L, 44L, 44L, 44L)), 
            class = "data.frame", row.names = c("8569", "8570", "8571", "8572", "8573", "8574", "8575", "8576", "8577", "8578", "8579"))

Upvotes: 3

Ronak Shah
Ronak Shah

Reputation: 388817

We can convert the dato column to Date, extract the year, find out the index of row where "2017" occurs for the first time and remove rows from that index.

df[1:which.max(format(as.Date(df$dato, "%d.%m.%Y"), "%Y") == "2017") - 1,]

#           dato horiginal hour mday mon year wday wk
#8569 31.12.2016        64   20   31  11  116    6 53
#8570 31.12.2016        70   21   31  11  116    6 53
#8571 31.12.2016        71   22   31  11  116    6 53
#8572 31.12.2016        71   23   31  11  116    6 53

Or if the year to remove is not "2017" always and the dates are always increasing we can use diff to find out the first index where there is drop in the date and remove everything after that.

df[1:which.max(diff(as.Date(df$dato, "%d.%m.%Y")) < 0) - 1, ]

Also as mentioned in the post wk column is always increasing hence, the above diff logic can be applied for wk column as well which would save us from converting dato to Date

df[1:which.max(diff(df$wk) < 0) - 1,]

Upvotes: 2

Related Questions