Reputation: 86
I am trying to merge >30,000 individual .dat files of climate data across the UK (with sequential names of met*****.dat) into one single data file that I can then extract annual values from. The data is formatted such that the first row in each file contains the lat long for that cell and then underneath that are the climate variables measured:
54.78 -7.06
iGrid x y year month SRad Tmin Tmax vp wind rain sunshine
19028 56 139 1971 1 2.22 1.25 6.43 7.29 12.33 133.47 1.72
19028 56 139 1971 2 3.94 1.26 6.74 7.36 11.11 106.64 1.85
19028 56 139 1971 3 7.30 0.95 7.56 7.05 12.77 57.26 2.57
19028 56 139 1971 4 12.09 2.29 10.07 7.71 11.02 112.81 4.24
19028 56 139 1971 5 15.51 4.85 13.07 9.20 13.94 103.08 5.02
19028 56 139 1971 6 16.18 6.36 14.09 9.98 12.44 115.03 4.76
19028 56 139 1971 7 15.59 9.25 17.56 12.83 10.45 107.18 4.73
19028 56 139 1971 8 11.92 9.20 15.98 12.70 11.39 136.40 3.35
19028 56 139 1971 9 9.69 7.85 16.31 12.72 12.24 41.76 4.11
19028 56 139 1971 10 5.43 6.96 12.88 10.56 16.55 99.72 2.90
19028 56 139 1971 11 2.55 2.78 8.36 8.39 17.98 195.24 1.56
19028 56 139 1971 12 1.51 3.29 8.02 8.68 16.74 54.63 0.89
19028 56 139 1972 1 1.96 -0.30 4.79 6.78 16.55 168.39 1.07
19028 56 139 1972 2 4.23 0.29 5.78 6.97 15.97 114.47 2.33
19028 56 139 1972 3 7.80 0.53 7.54 7.71 14.02 165.09 3.14
19028 56 139 1972 4 12.51 2.80 10.32 8.26 13.39 116.30 4.63
19028 56 139 1972 5 14.49 4.46 11.50 8.98 15.54 187.31 4.17
19028 56 139 1972 6 16.95 5.36 12.90 9.50 12.45 123.17 5.39
19028 56 139 1972 7 15.35 8.61 16.81 12.63 7.41 105.49 4.53
19028 56 139 1972 8 11.40 8.04 15.17 11.66 11.85 86.44 2.88
19028 56 139 1972 9 9.60 5.54 14.03 10.19 9.96 24.70 4.01
19028 56 139 1972 10 5.29 6.04 12.06 10.27 12.92 79.35 2.70
19028 56 139 1972 11 2.49 1.74 7.09 8.02 14.79 169.47 1.43
19028 56 139 1972 12 1.55 1.80 7.45 8.00 17.66 153.31 1.01
19028 56 139 1973 1 1.73 2.16 6.32 7.98 15.25 202.89 0.52
19028 56 139 1973 2 4.08 0.27 5.47 7.17 14.71 126.04 2.07
19028 56 139 1973 3 8.05 0.90 8.30 7.24 13.64 49.84 3.43
19028 56 139 1973 4 12.54 1.98 9.05 7.27 13.02 86.45 4.66
19028 56 139 1973 5 14.55 4.71 12.37 9.41 13.18 102.20 4.22
19028 56 139 1973 6 16.05 7.42 16.17 11.17 10.93 64.45 4.65
19028 56 139 1973 7 13.72 9.66 16.39 12.82 8.16 111.27 3.18
19028 56 139 1973 8 12.23 9.62 17.07 12.93 10.37 97.53 3.62
19028 56 139 1973 9 9.11 7.95 15.02 11.65 12.63 116.15 3.49
19028 56 139 1973 10 5.23 4.51 10.78 9.45 10.27 71.68 2.61
19028 56 139 1973 11 2.37 1.84 7.63 7.91 14.48 162.53 1.17
19028 56 139 1973 12 1.56 1.19 6.79 7.72 14.65 174.89 1.03
Reading previous answers (This and this) has helped me arrive at this code:
library(data.table)
filenames <- list.files(path="/mydirectory",
pattern = ".dat", full.names=T)
temp <- lapply(filenames, fread, sep=",")
data <- rbindlist(temp)
write.table(data,"metcombi.dat", sep=",")
Which spits out a huge .dat file (the input files occupy ~2.5GB) but at the end of each file it repeats the column headings and erases the lat long values.
1) How do I stop it from repeating the column headers for each file I'm merging? 2) How can I extract the lat long values from the first row in each file and move them to new columns?
Thank you!
Upvotes: 2
Views: 1084
Reputation: 107567
Consider expanding your lapply()
to do a quick read of first line, read in .dat file into dataframe, and then split lat/long first lines to columns:
filenames <- list.files(path="/mydirectory",
pattern=".dat", full.names=TRUE)
dfList <- lapply(filenames, function(f){
# FIRST LINES
con <- file(description=f, open="r")
latlong <- readLines(con, warn=FALSE)[1]
close(con)
# DATA FRAME
df <- read.table(f, skip=1, header=TRUE)
df$lat <- strsplit(latlong, "\\s+")[[1]][1]
df$lon <- strsplit(latlong, "\\s+")[[1]][2]
return(df)
})
data <- rbindlist(dfList)
write.csv(data, "metcombi.dat")
Upvotes: 2
Reputation: 4960
From what I understand of the problem above, I would suggest parsing the files in two phases:
First, grab the first line from each file, e.g.:
headers <- data.frame()
for (filename in filenames) {
fp = open(filename)
headers <- rbind(headers, readLines(fp, n=1))
close(fp)
}
Next, use your same code above to loop over the files and grab the main data from each of them, but skip the header line using something like header=TRUE
or skip=1
:
...
temp <- lapply(filenames, fread, sep=",", header=TRUE)
data <- rbindlist(temp)
The only thing left to do is to add the lat,long information back in from the first loop. I'm not sure what the format is, but this may be as simple as a cbind()
call on the two data.frames.
Upvotes: 0