Reputation: 586
I am trying to "import" data from a non-data.frame shape text file with multiple precipitation rates reports. The reports are all equal, a sample of one is the following:
I D E A M - INSTITUTO DE HIDROLOGIA, METEOROLOGIA Y ESTUDIOS AMBIENTALES
INFORMATION SYSTEM
PRECIPITATION TOTAL VALUES (mms) NATIONAL ENVIRONMENTAL
DATE OF PROCESS : 2015/09/15 YEAR 1980 STATION ID : 11010010 VUELTA LA
LAT 0527 N TIPO EST PM STATE CHOCO INSTALLATION DATE 1943-ENE
LON 7632 W ENT 01 IDEAM CITY LLORO FECHA-SUSPENSION
ELE 100 m.s.n.m REGIONAL 01 ANTIOQUIA CORRIENTE ANDAGUEDA
DAY JAN * FEB * MAR * APR * MAY * JUN * JUL * AGO * SEP * OCT * NOV * DEC *
01 30.0 .0 .0 3.0 80.0 .0 3.0 .0 35.0 88.0 1.0
02 .0 1.0 .0 1.0 100.0 .0 .0 6.0 1.0 65.0 69.0
03 35.0 100.0 .0 10.0 .0 .0 .0 70.0 40.0 42.0 16.0
04 .0 .0 80.0 3.0 140.0 8.0 .0 135.0 20.0 48.0 15.0
05 .0 .0 .0 8.0 3.0 20.0 4.0 19.0 80.0 .0 20.0
06 .0 .0 100.0 138.0 .0 6.0 .0 4.0 20.0 .0 10.0
07 31.0 10.0 .0 30.0 15.0 50.0 6.0 .0 4.0 .0 .0
08 .0 44.0 .0 10.0 40.0 .0 .0 .0 7.0 .0 4.0
09 35.0 3.0 23.0 .0 20.0 140.0 .0 6.0 .0 32.0 16.0
10 .0 75.0 .0 .0 60.0 .0 .0 23.0 3.0 1.0 5.0
11 .0 17.0 .0 15.0 80.0 .0 .0 80.0 .0 .0 3.0
12 .0 75.0 .0 8.0 .0 63.0 10.0 .0 .0 17.0 10.0
13 .0 20.0 .0 60.0 .0 .0 .0 110.0 50.0 3.0 25.0
14 55.0 .0 26.0 12.0 .0 3.0 140.0 4.0 74.0 .0 38.0
15 .0 .0 3.0 7.0 10.0 .0 6.0 .0 35.0 12.0 27.0
16 .0 4.0 89.0 20.0 3.0 .0 .0 10.0 .0 .0 .0
17 45.0 .0 9.0 .0 30.0 .0 2.0 .0 60.0 103.0 .0
18 30.0 .0 .0 .0 21.0 .0 20.0 15.0 .0 .0 .0
19 .0 130.0 .0 10.0 12.0 8.0 .0 3.0 20.0 49.0 40.0
20 45.0 .0 25.0 190.0 .0 38.0 8.0 .0 8.0 3.0 1.0
21 1.0 .0 45.0 50.0 .0 35.0 .0 2.0 13.0 1.0 4.0
22 .0 .0 20.0 .0 .0 .0 .0 16.0 10.0 12.0 50.0
23 40.0 .0 40.0 16.0 .0 30.0 .0 13.0 2.0 106.0 10.0
24 .0 .0 45.0 60.0 .0 3.0 .0 25.0 .0 16.0 .0
25 .0 .0 .0 .0 18.0 10.0 .0 3.0 .0 50.0 20.0
26 10.0 .0 .0 .0 9.0 6.0 20.0 20.0 6.0 15.0 3.0
27 .0 135.0 60.0 40.0 80.0 15.0 .0 18.0 10.0 77.0 .0
28 10.0 .0 9.0 15.0 .0 .0 .0 6.0 72.0 102.0 .0
29 23.0 6.0 .0 .0 .0 .0 .0 23.0 .0 34.0 .0
30 .0 10.0 .0 20.0 3.0 .0 64.0 14.0 111.0 .0
31 .0 31.0 10.0 .0 .0 .0
*** ANNUAL VALUES ***
TOTAL 6954.0
No DE RAIN DAYS 210
MAX 24 Hrs 190.0
The text file includes one report after the other, all with the same header "I D E A M - INSTITUTO DE HIDROLOGIA, METEOROLOGIA Y ESTUDIOS AMBIENTALES"
. I have already "read" the text file using the readLines()
function and I was hoping to create a data frame with the information of each report, something like this:
DATE STATION_ID LAT LON ELE CITY STATE PRECIPITATION
01/JAN/1980 11010010 0527 N 7632 W 100 LLORO CHOCO 0
I have been trying split each report and then start to parse each line. Unfortunately is a slow process. I understand this page looks for delimited questions, but I am kind of stuck.
Thanks in advance.
Upvotes: 1
Views: 2156
Reputation: 460
Here's one way to do it.
readLines()
to read in the full page, 56 lines.substr()
cbind
that with the header information.14 + dayOfMonth
, horizontal offset can be a vector with 12 numbers, one for each month. Add that column to your page.If you rbind
each page as you go through, you will end up with a long (!) tidy dataset. [edit] You will also spend an eternity as memory is managed if your dataset is large. Instead you can create a list of dataframes and bind them all at the end. See this question and this question for more information.
Here is some code I came up with: you can test it on a short extract first.
library("lubridate")
raw2page <- function(rawdata) {
# Takes a vector of chars, one page of data, returns a tidy dataframe
# Template for the page header
yearbound <- c(5,60,63)
stationbound <- c(5,105,112)
latbound <- c(7,16,19)
longbound <- c(8,16,19)
deptobound <- c(7,81,101)
municipiobound <- c(8,81,101)
framebounds <- rbind(yearbound,stationbound,latbound,longbound,deptobound,municipiobound)
colnames(framebounds) <- c("line","start","end")
framebounds <- as.data.frame(framebounds)
framedata <- data.frame()
framedata <- as.data.frame(rbind(with(framebounds, substr(rawdata[line],start,end))))
colnames(framedata) <- c("year","station","latitude","longitude","depto","municipio")
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
framedata$depto <- trim(framedata$depto)
framedata$municipio <- trim(framedata$municipio)
# Make a column listing all dates of the year
st <- as.Date(paste(framedata[1]$year,"-01-01",sep=""))
en <- as.Date(paste(framedata[1]$year,"-12-31",sep=""))
date <- seq(as.Date(st),as.Date(en), by=1)
pagedata <- cbind(framedata,date)
# horizontal offsets for the last digit of each month (the last digit is aligned)
mboundaries<-c(25,34,43,52,61,70,79,88,97,106,115,124)
# now we can take the dates we generated before and use these coordinates to read the rainfall amount into a vector
rainfall <- as.numeric(substr(rawdata[14+mday(pagedata$date)],mboundaries[month(pagedata$date)]-6,mboundaries[month(pagedata$date)] ))
# and bind the vector to the page data to make a tidy data set
page <- cbind(pagedata,rainfall)
page
}
raw <- readLines("area1.txt") # read in all the data
# Get all the page header line numbers
headers <- as.data.frame(grep("HIDROLOGIA", raw))
colnames(headers) <- c("linenum")
listOfDataFrames <- vector(mode = "list", length = nrow(headers))
# page by page, append onto the list
output <- data.frame()
for (i in 1:nrow(headers)) {
start <- headers[i,]
end <- start + 56
listOfDataFrames[[i]] <- raw2page(raw[start:end])
}
library("plyr")
output <- rbind.fill(listOfDataFrames)
print(summary(output))
Upvotes: 3