Select rows by conditions in R

Question

I'm trying to get a value from data with several conditions. The data I got is 1 file in 1 month and the data that I got was not in sequential time. The data looks like this

                                        measure       value
1                            Station identifier        WAML
2                                Station number       97072
3                              Observation time 150101/0000
...
27       Mean mixed layer potential temperature      298.68
28                Mean mixed layer mixing ratio       16.77
29                1000 hPa to 500 hPa thickness     5773.00
30  Precipitable water [mm] for entire sounding       55.86
31                           Station identifier        WAML
32                               Station number       97072
33                             Observation time 150109/1200
...
57       Mean mixed layer potential temperature      300.78
58                Mean mixed layer mixing ratio       16.29
59                1000 hPa to 500 hPa thickness     5784.00
60  Precipitable water [mm] for entire sounding       52.46
61                           Station identifier        WAML
62                               Station number       97072
63                             Observation time 150110/0000
...
87       Mean mixed layer potential temperature      297.48
88                Mean mixed layer mixing ratio       16.55
89                1000 hPa to 500 hPa thickness     5760.00
90                           Station identifier        WAML
91                               Station number       97072
92                             Observation time 150110/1200
...

The data and I want to filter by "Observation time" and "Precipitable water [mm] for entire sounding", so I can get the value. But there are cases at one time observation there is no Precipitable water data and only Observation time with other parameters.

I tried using :

df1 <-  dplyr::filter(obs.tpw, grepl(paste(c("Observation time", "Precipitable water [mm] for entire sounding"), collapse = "&"), paste(measure, value, sep = "_")))

But there is no data in there,

How I can get the data only Observation time and Precipitable water parameter only with value and then arrange them in a sequential time. Observation time value is 'data'/'time', 150101 is (year)(month)(day)/(hour)(minute). The data that I got was not sorted by date and hour. For example the first observation time is 150101/0000, then the second time is 150109/1200, the second time should be 150101/1200 because in one day there are 2 times the observations (0000 and 1200)

the final data that i want to look like this:

                                       measure       value
1                             Observation time 150101/0000
2  Precipitable water [mm] for entire sounding       55.86
3                             Observation time 150101/1200
4  Precipitable water [mm] for entire sounding       52.46
5                             Observation time 150102/0000
6  Precipitable water [mm] for entire sounding       61.15
7                             Observation time 150102/1200
8  Precipitable water [mm] for entire sounding       55.93
9                             Observation time 150103/0000
10 Precipitable water [mm] for entire sounding       52.25
11                            Observation time 150103/1200
12 Precipitable water [mm] for entire sounding       61.48
13                            Observation time 150104/0000
14 Precipitable water [mm] for entire sounding          NA
15                            Observation time 150104/1200
16 Precipitable water [mm] for entire sounding       61.92
17                            Observation time 150105/0000
18 Precipitable water [mm] for entire sounding          NA
19                            Observation time 150105/1200
20 Precipitable water [mm] for entire sounding       57.42

graggsd · Accepted Answer

I have made the following assumptions that were not clear in your question above (I will modify my answer as needed should these be incorrect):

Unique observations are denoted by a combination of Station identifier, Station number, and Observation time
Every observation contains these three identifiers, and they always occur in the same order directly before data related to that observation
I don't know anything about the time-date format used in Observation time, but I am guessing it is something along the lines of 'date'/'time', where 'date' is an integer sequence that refers to the number of days after a certain reference date.

First, try to include a reproducible dataset in these questions or a link to the publicly available data:

# Create Reproducible Dataset ---------------------------------------------
measure <- c("Station identifier", 
             "Station number", 
             "Observation time", "Mean mixed layer potential temperature", 
             "Mean mixed layer mixing ratio", "1000 hPa to 500 hPa thickness",
             "Precipitable water [mm] for entire sounding", "Station identifier", 
             "Station number", "Observation time", 
             "Mean mixed layer potential temperature",
             "Mean mixed layer mixing ratio", "1000 hPa to 500 hPa thickness", 
             "Precipitable water [mm] for entire sounding", "Station identifier", 
             "Station number", "Observation time", 
             "Mean mixed layer potential temperature", 
             "Mean mixed layer mixing ratio", 
             "1000 hPa to 500 hPa thickness", "Station identifier", 
             "Station number", "Observation time")
value <- c("WAML", "97072", "150101/0000", "298.68", "16.77", "5773.00", "55.86", 
           "WAML", "97072", "150109/1200", "300.78", "16.29", "5784.00", "52.46", 
           "WAML", "97072", "150110/0000", "297.48", "16.55", "5760.00", "WAML", 
           "97072", "150110/1200")
df <- data.frame(measure = measure, value = value, stringsAsFactors = FALSE)

Now to your question:

# Solution ----------------------------------------------------------------

# Create index of rows where `measure == "Station identifier"`
idx <- which(df$measure == "Station identifier")

df %>% 
    # Create Unique Identifier for each station
    dplyr::mutate(station_id = cut(1:nrow(df), 
                                   c(idx, nrow(df)),
                                   right = FALSE, 
                                   include.lowest = TRUE)) %>% 
    dplyr::filter(measure %in% c("Observation time", 
                                 "Precipitable water [mm] for entire sounding")) %>% 
    # Turn each value in measure to a new column
    tidyr::pivot_wider(names_from = "measure", values_from = "value", ) %>% 
    # Inelegant way of sorting by date and time
    dplyr::mutate(ot =  as.numeric(sub("\/", ".", `Observation time`))) %>% 
    dplyr::arrange(ot) %>% 
    dplyr::select(-ot) %>% 
    tidyr::drop_na()

Finally, I would like to point out that though you will likely be able to parse and analyze this data just fine using the tidyverse brand of packages, if your field of study requires frequent use of geospatial, spatiotemporal, or atmospheric data, it seems there is already a large collection of R packages that are specially built for this purpose. I have absolutely no experience in this regard, but from my brief search, the spacetime package on CRAN seems promising as it may be able to handle data in this format. Another resource that may be useful is the following primer by Edzer Pebesma.

I hope this is useful.

Select rows by conditions in R

Answers (1)

Related Questions