LetEpsilonBeLessThanZero
LetEpsilonBeLessThanZero

Reputation: 2403

How can I manipulate data in this specific way?

Apologies for the vague title, but part of the issue is that I don't know the jargon to describe what I'm trying to do here.

I have data that looks like this:

STATE   COUNT   DATE
LA       20     20160401
ME       83     20160601
MD        4     20160701

And what I want to do is be able to pick a date (let's say 20160301) and manipulate my data into the following:

STATE   COUNT   DATE      NEWER_THAN
LA       20     20160401  20160301
LA       20     20160401  20160401  
ME       83     20160601  20160301
ME       83     20160601  20160401
ME       83     20160601  20160501
ME       83     20160601  20160601
MD        4     20160701  20160301
MD        4     20160701  20160401
MD        4     20160701  20160501
MD        4     20160701  20160601
MD        4     20160701  20160701

So LA had a date of 20160401 in the first table, so it ends up with 2 rows in the second table since March and April are on or after the date I picked of 20160301.

Likewise, ME had a date of 201601 in the first table, so it ends up with 4 rows in the second table since March, April, May, and June are on or after the date I picked of 20160301.

And my question is whether or not anyone can get me started on how I might achieve this type of data manipulation in R.

Upvotes: 0

Views: 64

Answers (1)

You-leee
You-leee

Reputation: 560

If you convert your "DATE" field into an actual Date object, you can get the difference between 2 Dates by substracting them. And from that difference you know, how many times you have to repeat each row. Following your eaxmple:

dataset <- data.frame(STATE = c("LA", "ME", "MD"), 
                  COUNT = c(20, 83, 4), 
                  DATE = c("20160401", "20160601", "20160701"), 
                  stringsAsFactors = F)

format <- "%Y%m%d"
dataset$DATE <- strptime(dataset$DATE, format)

newDate <- strptime("20160301", format)

n.times <- ceiling(difftime(dataset$DATE, newDate, units = "days")/30)
dataset[rep(seq_len(nrow(dataset)), n.times), ]

Will result in the following:

    STATE COUNT       DATE
     LA    20 2016-04-01
     LA    20 2016-04-01
     ME    83 2016-06-01
     ME    83 2016-06-01
     ME    83 2016-06-01
     ME    83 2016-06-01
     MD     4 2016-07-01
     MD     4 2016-07-01
     MD     4 2016-07-01
     MD     4 2016-07-01
     MD     4 2016-07-01

Upvotes: 1

Related Questions