Toolbox
Toolbox

Reputation: 2491

Clean column from duplicates being in blocks

Question: How can I clean data from "duplicates" within blocks. I use the term [blocks] to illustrate that 2 values (in same column) are equal and are positioned either above of below.

In column [c1] I have the values [2] and [3].

Value [2] should never have value [2] under.

Value [3] should never have value [3] under.

I cannot use a standard duplication removal function, because there will be duplicates in the column. It is not possible to delete the rows manual since they will be in amount of thousands.

If possible it would be good to solve without loading any R packages.

My R-file:

##########
# Test xts
##########
dates <- as.POSIXct(c
                    (
                      "2013-07-24 09:01:00", 
                      "2013-07-24 09:02:00", 
                      "2013-07-24 09:03:00",
                      "2013-07-24 09:04:00",
                      "2013-07-24 09:05:00",
                      "2013-07-24 09:06:00",
                      "2013-07-24 09:07:00"
                      )
                    )
c1 <- c(2,3,2,2,3,3,2)              # Data in c1.
# c2 <- c(0,3,2,2,3,0,2)            # Data in c2.
data <- data.frame(c1)              # Create a dataframe.
xts9 <- xts(x=data, order.by=dates) # Create xts based on dataframe.

The result of running the R-file:

                    c1
2013-07-24 09:01:00  2
2013-07-24 09:02:00  3
2013-07-24 09:03:00  2
2013-07-24 09:04:00  2
2013-07-24 09:05:00  3
2013-07-24 09:06:00  3
2013-07-24 09:07:00  2

Comments of which lines should be deleted:

                    c1
2013-07-24 09:01:00  2
2013-07-24 09:02:00  3
2013-07-24 09:03:00  2
2013-07-24 09:04:00  2 # To be remove due to having a 2 above.
2013-07-24 09:05:00  3
2013-07-24 09:06:00  3 # To be remove due to having a 2 above.
2013-07-24 09:07:00  2

Upvotes: 1

Views: 43

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 389175

We can use rleid function from data.table and then use duplicated to remove the repeating rows.

library(data.table)
xts9[!duplicated(rleid(xts9)), ]

#                    c1
#2013-07-24 09:01:00  2
#2013-07-24 09:02:00  3
#2013-07-24 09:03:00  2
#2013-07-24 09:05:00  3
#2013-07-24 09:07:00  2

If you want to do this in base R, we can use rle instead using the same logic

x <- rle(rowSums(xts9))
xts9[!duplicated(rep(seq_along(x$values), x$lengths)), ]

#                    c1
#2013-07-24 09:01:00  2
#2013-07-24 09:02:00  3
#2013-07-24 09:03:00  2
#2013-07-24 09:05:00  3
#2013-07-24 09:07:00  2

Upvotes: 1

Related Questions