Gull
Gull

Reputation: 117

Rearrangment of time series data

I am not good at "R" and not sure how to rearrange and subset time series data. Sorry, if this question sounds stupid. I have a time series data of sea tide with four values per day (with missing values as well). Two values for high tide and two values for low tide. The time and date are given in the same column but in different rows. Now, I want to subset the data only for daytime (from 7:00 AM to 7:00 PM) not for night. Then I want to have data arranged in three columns only i) Date, ii) Time and iii) Tide. For Tide, I only need minimum and maximum values. Here is an example of the data and the desired arrangement of data. For each date, data is arranged in three rows similar to the example.

1/1/2011    Low High    Low NA
Time    2:58 AM 9:38 AM 5:19 PM NA
Tide    1.2 m   2.2 m   0.6 m   NA
1/2/2011    High    Low High    Low
Time    2:07 AM 4:22 AM 10:19 AM    6:07 PM
Tide    1.4 m   1.3 m   2.3 m   0.4 m

Date    Time    Tide
1/1/2011    17:19   0.6
1/1/2011    9:38    2.2
1/2/2011    2:07    1.4
1/2/2011    18:07   0.4

Upvotes: 1

Views: 124

Answers (2)

G. Grothendieck
G. Grothendieck

Reputation: 269644

The input, DF is assumed to be as in the Note below.

g, the grouping vector, has one element per row of DF and is equal to c(1, 1, 1, 2, 2, 2, ...). Alternate ways to compute g would be n <- nrow(DF); g <- gl(n, 3, n) or n <- nrow(DF); g <- rep(1:3, n, n).

We then use by to split DF into groups and apply the indicated anonymous function to each group as defined by g.

The anonymous function combines the date and the times in the current group to create the date/times dt making use of the fact that the common date is x[1,1] and the times prior to being cleaned up are in x[2,-1]. Using dt and the tides in x[2, -1] (prior to being cleaned up) it computes each of the three columns arranging them into a data frame. Then there is a commented out line which removes NA values. If you want this uncomment it. Subset the data frame obtained so far to the 7am to 7pm time period and further take the two rows consisting of the min and max tide. We sort that by time.

Finally do.call("rbind", ...) puts the groups together into one overall data frame.

No packages are used.

g <- cumsum(grepl("\\d", DF$V1))
Long <- do.call("rbind", by(DF, g, function(x) {
  dt <- as.POSIXct(paste(x[1,1], as.matrix(x[2, -1])), format = "%m/%d/%Y %I:%M %p")
  X <- data.frame(Date = as.Date(dt),
                  Time = format(dt, "%H:%M"),
                  Tide = as.numeric(sub("m", "", as.matrix(x[3, -1]))),
                  stringsAsFactors = FALSE)
  # X <- na.omit(X)
  X <- subset(X, Time >= "07:00" & Time <= "19:00")
  X <- X[c(which.min(X$Tide), which.max(X$Tide)), ]
  X[order(X$Time), ]
}))

giving the following -- note that the third row in the question's output is not between 7am and 7pm so the output here necessarily differs.

> Long
          Date  Time Tide
1.2 2011-01-01 09:38  2.2
1.3 2011-01-01 17:19  0.6
2.3 2011-01-02 10:19  2.3
2.4 2011-01-02 18:07  0.4

Note: The input DF is assumed to be as follows in reproducible form:

Lines <- "1/1/2011,Low,High,Low,NA
Time,2:58 AM,9:38 AM,5:19 PM,NA
Tide,1.2 m,2.2 m,0.6 m,NA
1/2/2011,High,Low,High,Low
Time,2:07 AM,4:22 AM,10:19 AM,6:07 PM
Tide,1.4 m,1.3 m,2.3 m,0.4 m"
DF <- read.table(text = Lines, sep = ",", as.is = TRUE)

Upvotes: 1

Nicol&#225;s Velasquez
Nicol&#225;s Velasquez

Reputation: 5898

If the list is not too long, this endeavour would be simpler to do in a spreadsheet simply by mapping cells and filtering. But one way to do it in R with zoo and tidyverse is the following:

Assuming that the original dataframes have their columns named as C1:C5

        C1      C2      C3       C4      C5
     <chr>   <chr>   <chr>    <chr>   <chr>
1 1/1/2010     Low    High      Low    <NA>
2     Time 2:58 AM 9:38 AM  5:19 PM    <NA>
3     Tide     1.2     2.2      0.6    <NA>
4 1/2/2011    High     Low     High     Low
5     Time 2:07 AM 4:22 AM 10:19 AM 6:07 PM
6     Tide     1.4     1.3      2.3     0.4

DF <- DF %>% 
        mutate(Date = as.Date(gsub("Tide|Time","", C1), format = "%d/%m/%Y"))
DF <- DF %>% 
        mutate(Date = na.locf(DF$Date, na.rm = TRUE),
                        C1 = gsub("[[:digit:]]|\\/", "", C1),
                        Type = if_else(nchar(C1) == 0, "TideType", C1)) %>% 
              select(Date, Type, C2:C5) %>% 
              gather(oColumn, Value, -c(Date, Type)) %>%
              spread(key = Type, value = Value) %>% 
              select(Date, Time, Tide) %>%
              filter(complete.cases(.))
DF <- DF %>% 
         mutate(Time = ymd_hm(paste(DF$Date, DF$Time, sep = " ")),
                Tide = as.numeric(Tide))
DF <- DF %>% 
         mutate(DayNight = (DF$Time) %within% 
          interval(as.POSIXlt(DF$Date) + (7*60*60), as.POSIXlt(DF$Date) + (19*60*60))) %>% 
        filter(DayNight == TRUE) %>%
        select(-DayNight) ) %>%
        group_by(Date) %>%
        filter(Tide == max(Tide) | min(Tide))

DF
Source: local data frame [4 x 3]
Groups: Date [2]

        Date                Time  Tide
      <date>              <dttm> <dbl>
1 2010-01-01 2010-01-01 09:38:00   2.2
2 2010-01-01 2010-01-01 17:19:00   0.6
3 2011-02-01 2011-02-01 10:19:00   2.3
4 2011-02-01 2011-02-01 18:07:00   0.4

Note that "Date" is a Date type of Object and "Time" is a Posixct type of Date-Time Object. You might want to convert "Time" into a vector of minutes.

Upvotes: 0

Related Questions