B.Wel
B.Wel

Reputation: 86

R: Merge >1000 .dat files

I am trying to merge >30,000 individual .dat files of climate data across the UK (with sequential names of met*****.dat) into one single data file that I can then extract annual values from. The data is formatted such that the first row in each file contains the lat long for that cell and then underneath that are the climate variables measured:

54.78   -7.06
 iGrid     x      y      year     month   SRad        Tmin       Tmax          vp        wind        rain        sunshine
19028      56     139    1971       1        2.22        1.25        6.43        7.29       12.33      133.47        1.72
19028      56     139    1971       2        3.94        1.26        6.74        7.36       11.11      106.64        1.85
19028      56     139    1971       3        7.30        0.95        7.56        7.05       12.77       57.26        2.57
19028      56     139    1971       4       12.09        2.29       10.07        7.71       11.02      112.81        4.24
19028      56     139    1971       5       15.51        4.85       13.07        9.20       13.94      103.08        5.02
19028      56     139    1971       6       16.18        6.36       14.09        9.98       12.44      115.03        4.76
19028      56     139    1971       7       15.59        9.25       17.56       12.83       10.45      107.18        4.73
19028      56     139    1971       8       11.92        9.20       15.98       12.70       11.39      136.40        3.35
19028      56     139    1971       9        9.69        7.85       16.31       12.72       12.24       41.76        4.11
19028      56     139    1971      10        5.43        6.96       12.88       10.56       16.55       99.72        2.90
19028      56     139    1971      11        2.55        2.78        8.36        8.39       17.98      195.24        1.56
19028      56     139    1971      12        1.51        3.29        8.02        8.68       16.74       54.63        0.89
19028      56     139    1972       1        1.96       -0.30        4.79        6.78       16.55      168.39        1.07
19028      56     139    1972       2        4.23        0.29        5.78        6.97       15.97      114.47        2.33
19028      56     139    1972       3        7.80        0.53        7.54        7.71       14.02      165.09        3.14
19028      56     139    1972       4       12.51        2.80       10.32        8.26       13.39      116.30        4.63
19028      56     139    1972       5       14.49        4.46       11.50        8.98       15.54      187.31        4.17
19028      56     139    1972       6       16.95        5.36       12.90        9.50       12.45      123.17        5.39
19028      56     139    1972       7       15.35        8.61       16.81       12.63        7.41      105.49        4.53
19028      56     139    1972       8       11.40        8.04       15.17       11.66       11.85       86.44        2.88
19028      56     139    1972       9        9.60        5.54       14.03       10.19        9.96       24.70        4.01
19028      56     139    1972      10        5.29        6.04       12.06       10.27       12.92       79.35        2.70
19028      56     139    1972      11        2.49        1.74        7.09        8.02       14.79      169.47        1.43
19028      56     139    1972      12        1.55        1.80        7.45        8.00       17.66      153.31        1.01
19028      56     139    1973       1        1.73        2.16        6.32        7.98       15.25      202.89        0.52
19028      56     139    1973       2        4.08        0.27        5.47        7.17       14.71      126.04        2.07
19028      56     139    1973       3        8.05        0.90        8.30        7.24       13.64       49.84        3.43
19028      56     139    1973       4       12.54        1.98        9.05        7.27       13.02       86.45        4.66
19028      56     139    1973       5       14.55        4.71       12.37        9.41       13.18      102.20        4.22
19028      56     139    1973       6       16.05        7.42       16.17       11.17       10.93       64.45        4.65
19028      56     139    1973       7       13.72        9.66       16.39       12.82        8.16      111.27        3.18
19028      56     139    1973       8       12.23        9.62       17.07       12.93       10.37       97.53        3.62
19028      56     139    1973       9        9.11        7.95       15.02       11.65       12.63      116.15        3.49
19028      56     139    1973      10        5.23        4.51       10.78        9.45       10.27       71.68        2.61
19028      56     139    1973      11        2.37        1.84        7.63        7.91       14.48      162.53        1.17
19028      56     139    1973      12        1.56        1.19        6.79        7.72       14.65      174.89        1.03

Reading previous answers (This and this) has helped me arrive at this code:

library(data.table)

filenames <- list.files(path="/mydirectory",
                    pattern = ".dat", full.names=T)

temp <- lapply(filenames, fread, sep=",")
data <- rbindlist(temp)

write.table(data,"metcombi.dat", sep=",")

Which spits out a huge .dat file (the input files occupy ~2.5GB) but at the end of each file it repeats the column headings and erases the lat long values.

1) How do I stop it from repeating the column headers for each file I'm merging? 2) How can I extract the lat long values from the first row in each file and move them to new columns?

Thank you!

Upvotes: 2

Views: 1084

Answers (2)

Parfait
Parfait

Reputation: 107567

Consider expanding your lapply() to do a quick read of first line, read in .dat file into dataframe, and then split lat/long first lines to columns:

filenames <- list.files(path="/mydirectory",
                        pattern=".dat", full.names=TRUE)

dfList <- lapply(filenames, function(f){
    # FIRST LINES
    con <- file(description=f, open="r")
    latlong <- readLines(con, warn=FALSE)[1]
    close(con)

    # DATA FRAME
    df <- read.table(f, skip=1, header=TRUE)
    df$lat <- strsplit(latlong, "\\s+")[[1]][1]
    df$lon <- strsplit(latlong, "\\s+")[[1]][2]

    return(df)
})

data <- rbindlist(dfList)

write.csv(data, "metcombi.dat")

Upvotes: 2

Keith Hughitt
Keith Hughitt

Reputation: 4960

From what I understand of the problem above, I would suggest parsing the files in two phases:

First, grab the first line from each file, e.g.:

headers <- data.frame()

for (filename in filenames) {
    fp = open(filename)
    headers <- rbind(headers, readLines(fp, n=1))
    close(fp)
}

Next, use your same code above to loop over the files and grab the main data from each of them, but skip the header line using something like header=TRUE or skip=1:

...
temp <- lapply(filenames, fread, sep=",", header=TRUE)
data <- rbindlist(temp)

The only thing left to do is to add the lat,long information back in from the first loop. I'm not sure what the format is, but this may be as simple as a cbind() call on the two data.frames.

Upvotes: 0

Related Questions