Reputation: 307
I'm trying to get a value from data with several conditions. The data I got is 1 file in 1 month and the data that I got was not in sequential time. The data looks like this
measure value
1 Station identifier WAML
2 Station number 97072
3 Observation time 150101/0000
...
27 Mean mixed layer potential temperature 298.68
28 Mean mixed layer mixing ratio 16.77
29 1000 hPa to 500 hPa thickness 5773.00
30 Precipitable water [mm] for entire sounding 55.86
31 Station identifier WAML
32 Station number 97072
33 Observation time 150109/1200
...
57 Mean mixed layer potential temperature 300.78
58 Mean mixed layer mixing ratio 16.29
59 1000 hPa to 500 hPa thickness 5784.00
60 Precipitable water [mm] for entire sounding 52.46
61 Station identifier WAML
62 Station number 97072
63 Observation time 150110/0000
...
87 Mean mixed layer potential temperature 297.48
88 Mean mixed layer mixing ratio 16.55
89 1000 hPa to 500 hPa thickness 5760.00
90 Station identifier WAML
91 Station number 97072
92 Observation time 150110/1200
...
The data and I want to filter by "Observation time" and "Precipitable water [mm] for entire sounding", so I can get the value. But there are cases at one time observation there is no Precipitable water data and only Observation time with other parameters.
I tried using :
df1 <- dplyr::filter(obs.tpw, grepl(paste(c("Observation time", "Precipitable water [mm] for entire sounding"), collapse = "&"), paste(measure, value, sep = "_")))
But there is no data in there,
How I can get the data only Observation time and Precipitable water parameter only with value and then arrange them in a sequential time. Observation time value is 'data'/'time'
, 150101 is (year)(month)(day)/(hour)(minute). The data that I got was not sorted by date and hour. For example the first observation time is 150101/0000, then the second time is 150109/1200, the second time should be 150101/1200 because in one day there are 2 times the observations (0000 and 1200)
the final data that i want to look like this:
measure value
1 Observation time 150101/0000
2 Precipitable water [mm] for entire sounding 55.86
3 Observation time 150101/1200
4 Precipitable water [mm] for entire sounding 52.46
5 Observation time 150102/0000
6 Precipitable water [mm] for entire sounding 61.15
7 Observation time 150102/1200
8 Precipitable water [mm] for entire sounding 55.93
9 Observation time 150103/0000
10 Precipitable water [mm] for entire sounding 52.25
11 Observation time 150103/1200
12 Precipitable water [mm] for entire sounding 61.48
13 Observation time 150104/0000
14 Precipitable water [mm] for entire sounding NA
15 Observation time 150104/1200
16 Precipitable water [mm] for entire sounding 61.92
17 Observation time 150105/0000
18 Precipitable water [mm] for entire sounding NA
19 Observation time 150105/1200
20 Precipitable water [mm] for entire sounding 57.42
Upvotes: 0
Views: 73
Reputation: 306
I have made the following assumptions that were not clear in your question above (I will modify my answer as needed should these be incorrect):
Station identifier
, Station number
, and Observation time
Observation time
, but I am guessing it is something along the lines of 'date'/'time'
, where 'date'
is an integer sequence that refers to the number of days after a certain reference date.First, try to include a reproducible dataset in these questions or a link to the publicly available data:
# Create Reproducible Dataset ---------------------------------------------
measure <- c("Station identifier",
"Station number",
"Observation time", "Mean mixed layer potential temperature",
"Mean mixed layer mixing ratio", "1000 hPa to 500 hPa thickness",
"Precipitable water [mm] for entire sounding", "Station identifier",
"Station number", "Observation time",
"Mean mixed layer potential temperature",
"Mean mixed layer mixing ratio", "1000 hPa to 500 hPa thickness",
"Precipitable water [mm] for entire sounding", "Station identifier",
"Station number", "Observation time",
"Mean mixed layer potential temperature",
"Mean mixed layer mixing ratio",
"1000 hPa to 500 hPa thickness", "Station identifier",
"Station number", "Observation time")
value <- c("WAML", "97072", "150101/0000", "298.68", "16.77", "5773.00", "55.86",
"WAML", "97072", "150109/1200", "300.78", "16.29", "5784.00", "52.46",
"WAML", "97072", "150110/0000", "297.48", "16.55", "5760.00", "WAML",
"97072", "150110/1200")
df <- data.frame(measure = measure, value = value, stringsAsFactors = FALSE)
Now to your question:
# Solution ----------------------------------------------------------------
# Create index of rows where `measure == "Station identifier"`
idx <- which(df$measure == "Station identifier")
df %>%
# Create Unique Identifier for each station
dplyr::mutate(station_id = cut(1:nrow(df),
c(idx, nrow(df)),
right = FALSE,
include.lowest = TRUE)) %>%
dplyr::filter(measure %in% c("Observation time",
"Precipitable water [mm] for entire sounding")) %>%
# Turn each value in measure to a new column
tidyr::pivot_wider(names_from = "measure", values_from = "value", ) %>%
# Inelegant way of sorting by date and time
dplyr::mutate(ot = as.numeric(sub("\\/", ".", `Observation time`))) %>%
dplyr::arrange(ot) %>%
dplyr::select(-ot) %>%
tidyr::drop_na()
Finally, I would like to point out that though you will likely be able to parse and analyze this data just fine using the tidyverse
brand of packages, if your field of study requires frequent use of geospatial, spatiotemporal, or atmospheric data, it seems there is already a large collection of R packages that are specially built for this purpose. I have absolutely no experience in this regard, but from my brief search, the spacetime package on CRAN seems promising as it may be able to handle data in this format. Another resource that may be useful is the following primer by Edzer Pebesma.
I hope this is useful.
Upvotes: 2