Reputation: 101
I have a long time series where I need to identify and flag sequences of repeated values. Here's some data:
DATETIME WDIR
1 40360.04 22
2 40360.08 23
3 40360.12 126
4 40360.17 126
5 40360.21 126
6 40360.25 126
7 40360.29 25
8 40360.33 26
9 40360.38 132
10 40360.42 132
11 40360.46 132
12 40360.50 30
13 40360.54 132
14 40360.58 35
So if I need to note when a value is repeated three or more times, I have a sequence of four '126' and a sequence of three '132' that need to be flagged.
I'm very new to R. I expect I use cbind to create a new column in this array with a "T" in the corresponding rows, but how to populate the column correctly is a mystery. Any pointers please? Thanks a bunch.
Upvotes: 10
Views: 6633
Reputation: 9380
Two options for you.
Assuming the data is loaded:
dat <- read.table(textConnection("
DATETIME WDIR
40360.04 22
40360.08 23
40360.12 126
40360.17 126
40360.21 126
40360.25 126
40360.29 25
40360.33 26
40360.38 132
40360.42 132
40360.46 132
40360.50 30
40360.54 132
40360.58 35"), header=T)
Option 1: Sorting
dat <- dat[order(dat$WDIR),] # needed for the 'repeats' to be pasted into the correct rows in next step
dat$count <- rep(table(dat$WDIR),table(dat$WDIR))
dat$more4 <- ifelse(dat$count < 4, F, T)
dat <- dat[order(dat$DATETIME),] # sort back to original order
dat
Option 2: Oneliner
dat$more4 <- ifelse(dat$WDIR %in% names(which(table(dat$WDIR)>3)),T,F)
dat
I thought being a new user that option 1 might be an easier step by step approach although the rep(table(), table())
may not be intuitive initially.
Upvotes: 1
Reputation: 173667
As Ramnath says, you can use rle
.
rle(dat$WDIR)
Run Length Encoding
lengths: int [1:9] 1 1 4 1 1 3 1 1 1
values : int [1:9] 22 23 126 25 26 132 30 132 35
rle
returns an object with two components, lengths and values. We can use the lengths piece to build a new column that identifies which values are repeated more than three times.
tmp <- rle(dat$WDIR)
rep(tmp$lengths >= 3,times = tmp$lengths)
[1] FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE
This will be our new column.
newCol <- rep(tmp$lengths > 1,times = tmp$lengths)
cbind(dat,newCol)
DATETIME WDIR newCol
1 40360.04 22 FALSE
2 40360.08 23 FALSE
3 40360.12 126 TRUE
4 40360.17 126 TRUE
5 40360.21 126 TRUE
6 40360.25 126 TRUE
7 40360.29 25 FALSE
8 40360.33 26 FALSE
9 40360.38 132 TRUE
10 40360.42 132 TRUE
11 40360.46 132 TRUE
12 40360.50 30 FALSE
13 40360.54 132 FALSE
14 40360.58 35 FALSE
Upvotes: 11
Reputation: 55715
Use rle
to do the job!! It is an amazing function that calculates the number of successive repetitions of numbers in a sequence. Here is some example code on how you can use rle
to flag the miscreants in your data. This will return all rows from the data frame which have WDIR that are repeated 3 or more times successively.
runs = rle(mydf$WDIR)
subset(mydf, WDIR %in% runs$values[runs$lengths >= 3])
Upvotes: 8