topcat
topcat

Reputation: 586

Read an irregular text data file into R

I am trying to "import" data from a non-data.frame shape text file with multiple precipitation rates reports. The reports are all equal, a sample of one is the following:

  I D E A M  -  INSTITUTO DE HIDROLOGIA, METEOROLOGIA Y ESTUDIOS AMBIENTALES
                                                                                                          INFORMATION SYSTEM
                                  PRECIPITATION TOTAL VALUES (mms)                              NATIONAL ENVIRONMENTAL 

DATE OF PROCESS :  2015/09/15                    YEAR  1980                              STATION ID : 11010010  VUELTA LA

LAT    0527 N               TIPO EST    PM                   STATE      CHOCO                   INSTALLATION DATE   1943-ENE
LON   7632 W               ENT     01  IDEAM            CITY  LLORO                   FECHA-SUSPENSION
ELE   100 m.s.n.m         REGIONAL    01  ANTIOQUIA        CORRIENTE  ANDAGUEDA


      DAY       JAN *  FEB *  MAR *  APR *  MAY  *  JUN *  JUL *  AGO *  SEP *  OCT *  NOV *  DEC *


       01                 30.0       .0       .0      3.0     80.0       .0      3.0       .0     35.0     88.0      1.0
       02                   .0      1.0       .0      1.0    100.0       .0       .0      6.0      1.0     65.0     69.0
       03                 35.0    100.0       .0     10.0       .0       .0       .0     70.0     40.0     42.0     16.0
       04                   .0       .0     80.0      3.0    140.0      8.0       .0    135.0     20.0     48.0     15.0
       05                   .0       .0       .0      8.0      3.0     20.0      4.0     19.0     80.0       .0     20.0
       06                   .0       .0    100.0    138.0       .0      6.0       .0      4.0     20.0       .0     10.0
       07                 31.0     10.0       .0     30.0     15.0     50.0      6.0       .0      4.0       .0       .0
       08                   .0     44.0       .0     10.0     40.0       .0       .0       .0      7.0       .0      4.0
       09                 35.0      3.0     23.0       .0     20.0    140.0       .0      6.0       .0     32.0     16.0
       10                   .0     75.0       .0       .0     60.0       .0       .0     23.0      3.0      1.0      5.0
       11                   .0     17.0       .0     15.0     80.0       .0       .0     80.0       .0       .0      3.0
       12                   .0     75.0       .0      8.0       .0     63.0     10.0       .0       .0     17.0     10.0
       13                   .0     20.0       .0     60.0       .0       .0       .0    110.0     50.0      3.0     25.0
       14                 55.0       .0     26.0     12.0       .0      3.0    140.0      4.0     74.0       .0     38.0
       15                   .0       .0      3.0      7.0     10.0       .0      6.0       .0     35.0     12.0     27.0
       16                   .0      4.0     89.0     20.0      3.0       .0       .0     10.0       .0       .0       .0
       17                 45.0       .0      9.0       .0     30.0       .0      2.0       .0     60.0    103.0       .0
       18                 30.0       .0       .0       .0     21.0       .0     20.0     15.0       .0       .0       .0
       19                   .0    130.0       .0     10.0     12.0      8.0       .0      3.0     20.0     49.0     40.0
       20                 45.0       .0     25.0    190.0       .0     38.0      8.0       .0      8.0      3.0      1.0
       21                  1.0       .0     45.0     50.0       .0     35.0       .0      2.0     13.0      1.0      4.0
       22                   .0       .0     20.0       .0       .0       .0       .0     16.0     10.0     12.0     50.0
       23                 40.0       .0     40.0     16.0       .0     30.0       .0     13.0      2.0    106.0     10.0
       24                   .0       .0     45.0     60.0       .0      3.0       .0     25.0       .0     16.0       .0
       25                   .0       .0       .0       .0     18.0     10.0       .0      3.0       .0     50.0     20.0
       26                 10.0       .0       .0       .0      9.0      6.0     20.0     20.0      6.0     15.0      3.0
       27                   .0    135.0     60.0     40.0     80.0     15.0       .0     18.0     10.0     77.0       .0
       28                 10.0       .0      9.0     15.0       .0       .0       .0      6.0     72.0    102.0       .0
       29                 23.0      6.0       .0       .0       .0       .0       .0     23.0       .0     34.0       .0
       30                            .0     10.0       .0     20.0      3.0       .0     64.0     14.0    111.0       .0
       31                            .0              31.0              10.0       .0                .0                .0


                                  ***  ANNUAL VALUES  ***

                                 TOTAL                  6954.0
                                 No DE RAIN DAYS         210
                                 MAX 24 Hrs        190.0

The text file includes one report after the other, all with the same header "I D E A M - INSTITUTO DE HIDROLOGIA, METEOROLOGIA Y ESTUDIOS AMBIENTALES". I have already "read" the text file using the readLines()function and I was hoping to create a data frame with the information of each report, something like this:

DATE        STATION_ID  LAT    LON    ELE CITY STATE PRECIPITATION
01/JAN/1980 11010010    0527 N 7632 W 100 LLORO CHOCO 0

I have been trying split each report and then start to parse each line. Unfortunately is a slow process. I understand this page looks for delimited questions, but I am kind of stuck.

Thanks in advance.

Upvotes: 1

Views: 2156

Answers (1)

Derwin McGeary
Derwin McGeary

Reputation: 460

Here's one way to do it.

  1. Use readLines() to read in the full page, 56 lines.
  2. Determine the information from the header by knowing the line numbers and positions in the line for latitude, longitude, elevation, city, state, and year. Use substr()
  3. Using the year obtained there, write out all the dates of that year. cbind that with the header information.
  4. Use a function that takes the day of month and month number, and locates the corresponding precipitation on the page. Line number is 14 + dayOfMonth, horizontal offset can be a vector with 12 numbers, one for each month. Add that column to your page.

If you rbind each page as you go through, you will end up with a long (!) tidy dataset. [edit] You will also spend an eternity as memory is managed if your dataset is large. Instead you can create a list of dataframes and bind them all at the end. See this question and this question for more information.

Here is some code I came up with: you can test it on a short extract first.

library("lubridate")
raw2page <- function(rawdata) {
# Takes a vector of chars, one page of data, returns a tidy dataframe
# Template for the page header
yearbound <- c(5,60,63)
stationbound <- c(5,105,112)
latbound <- c(7,16,19)
longbound <- c(8,16,19)
deptobound <- c(7,81,101)
municipiobound <- c(8,81,101)

framebounds <- rbind(yearbound,stationbound,latbound,longbound,deptobound,municipiobound)
colnames(framebounds) <- c("line","start","end")
framebounds <- as.data.frame(framebounds)

framedata <- data.frame()
framedata <- as.data.frame(rbind(with(framebounds, substr(rawdata[line],start,end))))
colnames(framedata) <- c("year","station","latitude","longitude","depto","municipio")
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
framedata$depto <- trim(framedata$depto)
framedata$municipio <- trim(framedata$municipio)

# Make a column listing all dates of the year
st <- as.Date(paste(framedata[1]$year,"-01-01",sep=""))
en <- as.Date(paste(framedata[1]$year,"-12-31",sep=""))
date <- seq(as.Date(st),as.Date(en), by=1)
pagedata <- cbind(framedata,date)

# horizontal offsets for the last digit of each month (the last digit is aligned)
mboundaries<-c(25,34,43,52,61,70,79,88,97,106,115,124)
# now we can take the dates we generated before and use these coordinates to read the rainfall amount into a vector
rainfall <- as.numeric(substr(rawdata[14+mday(pagedata$date)],mboundaries[month(pagedata$date)]-6,mboundaries[month(pagedata$date)] ))
# and bind the vector to the page data to make a tidy data set 
page <- cbind(pagedata,rainfall)
page
}

raw <- readLines("area1.txt") # read in all the data

# Get all the page header line numbers
headers <- as.data.frame(grep("HIDROLOGIA", raw))
colnames(headers) <- c("linenum")

listOfDataFrames <- vector(mode = "list", length = nrow(headers))

# page by page, append onto the list
output <- data.frame()
for (i in 1:nrow(headers)) {
  start <- headers[i,]
  end <- start + 56
  listOfDataFrames[[i]] <- raw2page(raw[start:end])
      }
library("plyr")
output <- rbind.fill(listOfDataFrames)
print(summary(output))

Upvotes: 3

Related Questions