Elizabeth
Elizabeth

Reputation: 101

Identifying sequences of repeated numbers in R

I have a long time series where I need to identify and flag sequences of repeated values. Here's some data:

   DATETIME WDIR
1  40360.04   22
2  40360.08   23
3  40360.12  126
4  40360.17  126
5  40360.21  126
6  40360.25  126
7  40360.29   25
8  40360.33   26
9  40360.38  132
10 40360.42  132
11 40360.46  132
12 40360.50   30
13 40360.54  132
14 40360.58   35

So if I need to note when a value is repeated three or more times, I have a sequence of four '126' and a sequence of three '132' that need to be flagged.

I'm very new to R. I expect I use cbind to create a new column in this array with a "T" in the corresponding rows, but how to populate the column correctly is a mystery. Any pointers please? Thanks a bunch.

Upvotes: 10

Views: 6633

Answers (3)

nzcoops
nzcoops

Reputation: 9380

Two options for you.

Assuming the data is loaded:

dat <- read.table(textConnection("
DATETIME WDIR
40360.04   22
40360.08   23
40360.12  126
40360.17  126
40360.21  126
40360.25  126
40360.29   25
40360.33   26
40360.38  132
40360.42  132
40360.46  132
40360.50   30
40360.54  132
40360.58   35"), header=T)

Option 1: Sorting

dat <- dat[order(dat$WDIR),] # needed for the 'repeats' to be pasted into the correct rows in next step
dat$count <- rep(table(dat$WDIR),table(dat$WDIR))
dat$more4 <- ifelse(dat$count < 4, F, T)
dat <- dat[order(dat$DATETIME),] # sort back to original order
dat

Option 2: Oneliner

dat$more4 <- ifelse(dat$WDIR %in% names(which(table(dat$WDIR)>3)),T,F)
dat

I thought being a new user that option 1 might be an easier step by step approach although the rep(table(), table()) may not be intuitive initially.

Upvotes: 1

joran
joran

Reputation: 173667

As Ramnath says, you can use rle.

rle(dat$WDIR)
Run Length Encoding
  lengths: int [1:9] 1 1 4 1 1 3 1 1 1
  values : int [1:9] 22 23 126 25 26 132 30 132 35

rle returns an object with two components, lengths and values. We can use the lengths piece to build a new column that identifies which values are repeated more than three times.

tmp <- rle(dat$WDIR)
rep(tmp$lengths >= 3,times = tmp$lengths)
[1] FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE

This will be our new column.

newCol <- rep(tmp$lengths > 1,times = tmp$lengths)
cbind(dat,newCol)
   DATETIME WDIR newCol
1  40360.04   22  FALSE
2  40360.08   23  FALSE
3  40360.12  126   TRUE
4  40360.17  126   TRUE
5  40360.21  126   TRUE
6  40360.25  126   TRUE
7  40360.29   25  FALSE
8  40360.33   26  FALSE
9  40360.38  132   TRUE
10 40360.42  132   TRUE
11 40360.46  132   TRUE
12 40360.50   30  FALSE
13 40360.54  132  FALSE
14 40360.58   35  FALSE

Upvotes: 11

Ramnath
Ramnath

Reputation: 55715

Use rle to do the job!! It is an amazing function that calculates the number of successive repetitions of numbers in a sequence. Here is some example code on how you can use rle to flag the miscreants in your data. This will return all rows from the data frame which have WDIR that are repeated 3 or more times successively.

runs = rle(mydf$WDIR)
subset(mydf, WDIR %in% runs$values[runs$lengths >= 3])

Upvotes: 8

Related Questions