Reputation: 51
I'm trying to use R to extract the data table - and a couple of other bits of information - from the historical Met Office data, but despite having spent all evening on StackOverflow keep running into problems.
For example, here's the data for sunny (maybe??) Lowestoft:
Lowestoft / Lowestoft Monckton Ave from Sept 2007
Location 654300E 294600N 25m amsl to July 2007
& from Sept 2007 653000E 293800N, Lat 52.483 Lon 1.727, 18m amsl
Estimated data is marked with a * after the value.
Missing data (more than 2 days missing in month) is marked by ---.
Sunshine data taken from an automatic Kipp & Zonen sensor marked with a #, otherwise sunshine data taken from a Campbell Stokes recorder.
yyyy mm tmax tmin af rain sun
degC degC days mm hours
1914 1 5.2 0.7 --- 52.0 ---
1914 2 9.2 3.5 --- 28.0 ---
1914 3 --- --- --- --- ---
1914 4 12.9 5.3 --- 18.0 ---
...
2020 11 12.5* 6.1* 0* 31.9* 73.7* Provisional
2020 12 7.7* 2.9* 6* 105.8* 50.5* Provisional
2021 1 5.8* 1.2* 10* 78.6* 49.4* Provisional
2021 2 7.9* 2.4* 9* 48.6* 84.7* Provisional
The best I've managed so far is to use sed
(outside of R) to remove the *'d and #'d variables, but importing this using read.table(lowestoftdata.text, skip = 8, col.names = c("year","month","max_temp", "min_temp", "frost", "rainfall", "sunshine"))
falls over when it hits the data of 2020 onwards flagged as provisional. It would also be really handy to extract the latitude and longitude values, which are usually on line 2 but can be on line 3 if, like Lowestoft, the station moved at some point, but my very limited regex knowledge (and a moving target) is letting me down.
My pseudocode approach would be to:
... but converting this into practice is proving challenging with my limited experience handling anything other than a nicely formatted CSV file, so any advice on even where to start would be greatly appreciated.
Upvotes: 0
Views: 925
Reputation: 263301
Here's an approach to the request to "parse" the header text:
metadata <-
readLines(url("https://www.metoffice.gov.uk/pub/data/weather/uk/climate/stationdata/lowestoftdata.txt"), n=9)
> metadata
[1] "Lowestoft / Lowestoft Monckton Ave from Sept 2007"
[2] "Location 654300E 294600N 25m amsl to July 2007 "
[3] "& from Sept 2007 653000E 293800N, Lat 52.483 Lon 1.727, 18m amsl"
[4] "Estimated data is marked with a * after the value."
[5] "Missing data (more than 2 days missing in month) is marked by ---."
[6] "Sunshine data taken from an automatic Kipp & Zonen sensor marked with a #, otherwise sunshine data taken from a Campbell Stokes recorder."
[7] " yyyy mm tmax tmin af rain sun"
[8] " degC degC days mm hours"
> sub( "Location (\\d+[EW]) (\\d+[NS])(.+$)", "\\1,\\2", metadata[2])
[1] "654300E,294600N"
I'll need to apply a "ruler" to the data to get the locations and widths for the read.fwf
approach.
> paste( rep("123456789",6), 1:6, collapse="", sep="")
[1] "123456789112345678921234567893123456789412345678951234567896"
> metadata[9]
[1] " 1914 1 5.2 0.7 --- 52.0 ---"
Here's the results as character. You will need to do some further processing to get rid of the asterisks before using as.numeric
. I illustrate it for one column. You can probably do something to edit the column names using metadata[9]
widths=c(3,4,4,7,8,7,10,7)
dat=read.fwf( "https://www.metoffice.gov.uk/pub/data/weather/uk/climate/stationdata/lowestoftdata.txt", widths = widths , skip=8, colClasses="character", header=FALSE)
Warning message:
In readLines(file, n = thisblock) :
incomplete final line found on 'https://www.metoffice.gov.uk/pub/data/weather/uk/climate/stationdata/lowestoftdata.txt'
tail(dat)
#---------------------
V1 V2 V3 V4 V5 V6 V7 V8
1269 2020 9 19.6 * 11.5 * 0* 97.1* 168.6
1270 2020 10 14.2 * 9.0 * 0* 85.7* 58.8
1271 2020 11 12.5 * 6.1 * 0* 31.9* 73.7
1272 2020 12 7.7 * 2.9 * 6* 105.8* 50.5
1273 2021 1 5.8 * 1.2 * 1 0* 78.6* 49.4
1274 2021 2 7.9 * 2.4 * 9* 48.6* 84.7
#----------------
head(dat)
V1 V2 V3 V4 V5 V6 V7 V8
1 1914 1 5.2 0.7 --- 52.0 ---
2 1914 2 9.2 3.5 --- 28.0 ---
3 1914 3 --- --- --- --- ---
4 1914 4 12.9 5.3 --- 18.0 ---
5 1914 5 13.7 7.2 --- 38.0 ---
6 1914 6 16.2 10.4 --- 38.0 ---
summary(as.numeric(sub("[*]","", dat$V8)))
#--------------------
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
11.0 70.3 136.3 136.1 189.9 314.4 157
There is also ?readr::read_fwf
and it has some advantages. For one it lets you specify the fwf using positions rahter than widths. I find that easier, especially if you use my makeshift "ruler".
Upvotes: 0
Reputation: 5419
Here is another go:
A bunch of diffferent things were necesary to clean this up.
First of all handling the 2-line header (those are always a pain). There might be simpler solutions for this, but at some point you just need to get the job done.
I merged the two lines to one, and used those slightly longer texts as headers.
The cleanup step before the data read is somewhat cryptic, but it strips anythinf from the end of the line that isn't a number, a dash or a star. (To trim those text comments which otherwise messes up the field parsing with fread, which is blazingly fast.)
library(data.table)
library(purrr)
raw.text <- read_file("https://www.metoffice.gov.uk/pub/data/weather/uk/climate/stationdata/lowestoftdata.txt")
lat.long <- as.numeric( str_match( raw.text, "Lat (\\d+\\.\\d+) Lon (\\d+\\.\\d+)" )[,-1] )
m <- regexpr( "+yyyy.*hours", raw.text )
headertext <- substr( raw.text, m, m+attr(m,"match.length")-1 )
header.lines <- strsplit( headertext, "\\r?\\n" )[[1]]
header.lines <- sub( "^\\s+", "", header.lines )
header.fields2 <- strsplit( header.lines, "\\s+" )
header.fields2[[2]] <- c( "", "", header.fields2[[2]] )
header.fields <- pmap_chr( header.fields2, paste, collapse=" " ) %>% str_trim
## some cleanup:
text.to.read <- substring( raw.text, m+attr(m,"match.length") )
## This next line matches anything that is not a digit (\\d) and not a dash (\\-) and not a star (\\*) until the end of the line, $. It's the enclosing (?m: ... ) that changes $ to match end of line, and not end of string as usual.
text.to.read2 <- gsub( "(?m:([^\\d\\-\\*]*)$)", "", text.to.read, perl=TRUE )
## by now a simple fread will do the rest for us
d <- fread( text=text.to.read2, fill=TRUE, header=FALSE, na="---" )
setnames(d, header.fields)
d
Output:
yyyy mm tmax degC tmin degC af days rain mm sun hours
1: 1914 1 5.2 0.7 <NA> 52.0 <NA>
2: 1914 2 9.2 3.5 <NA> 28.0 <NA>
3: 1914 3 <NA> <NA> <NA> <NA> <NA>
4: 1914 4 12.9 5.3 <NA> 18.0 <NA>
5: 1914 5 13.7 7.2 <NA> 38.0 <NA>
---
1270: 2020 10 14.2* 9.0* 0* 85.7* 58.8*
1271: 2020 11 12.5* 6.1* 0* 31.9* 73.7*
1272: 2020 12 7.7* 2.9* 6* 105.8* 50.5*
1273: 2021 1 5.8* 1.2* 10* 78.6* 49.4*
1274: 2021 2 7.9* 2.4* 9* 48.6* 84.7*
Upvotes: 2