Reputation: 123
I have a series of csv files that I am trying to analyse. The files have this structure.
"","x"
"1","<HTML>"
"2","<TITLE>Radiosonde Data</TITLE>"
"3","<LINK REL=""StyleSheet"" HREF=""/resources/select.css"" TYPE=""text/css"">"
"4","<BODY BGCOLOR=""white"">"
"5","<H2>15420 LRBS Bucuresti Observations at 00Z 01 Jan 1980</H2>"
"6","<PRE>"
"7","-----------------------------------------------------------------------------"
"8"," PRES HGHT TEMP DWPT RELH MIXR DRCT SKNT THTA THTE THTV"
"9"," hPa m C C % g/kg deg knot K K K "
"10","-----------------------------------------------------------------------------"
"11"," 1000.0 -55 "
"12"," 981.0 91 1.4 1.4 100 4.34 45 26 276.1 288.0 276.8"
"13"," 850.0 1257 3.6 3.6 100 5.86 155 28 289.9 306.9 290.9"
"14"," 700.0 2809 -5.1 -5.1 100 3.76 170 44 296.8 308.2 297.5"
"15"," 500.0 5350 -24.9 -25.7 93 0.95 180 36 302.6 305.8 302.8"
"16"," 400.0 6940 -36.0 -36.9 91 0.41 195 36 308.1 309.6 308.2"
"17"," 300.0 8870 -51.5 -52.8 86 0.10 200 46 312.6 313.0 312.7"
"18"," 250.0 10050 -60.3 -61.7 83 0.04 205 66 316.3 316.4 316.3"
"19"," 238.0 10345 -61.5 -62.9 83 0.03 200 80 319.0 319.1 319.0"
"20"," 200.0 11390 -65.7 -67.3 80 0.02 215 60 328.6 328.7 328.6"
"21"," 150.0 13170 -60.3 -61.9 81 0.06 225 56 366.0 366.3 366.0"
"22"," 100.0 15680 -62.1 -63.8 80 0.07 235 48 407.5 407.9 407.5""
"23","</PRE><H3>Station info</H3><PRE>"
"24"," Station ID: LRBS"
The aim is to extract the embedded tables (the data between lines 11-22 in the above example) as separate data frames with 11 columns. So far have I have written a script that reads all the text lines (using readLines) and then identifies the start and the end of each table. But in this way the data manipulation is not very easy. I am wondering what will be a more efficient way of extracting the tables.
Thank you very much for all your help.
Upvotes: 0
Views: 68
Reputation: 94277
Given a file called ick.csv
in that format:
Read into two columns - ignore all quotes:
txt = read.table("ick.csv",head=FALSE,quote='',sep=',',skip=1, as.is=TRUE)
Now where does the data start? Look for a load of minus signs in the second column:
headlines = grep("------",txt$V2)
There should be two of those. The header labels are on the line after the first one, and the data starts after the second one, and we want all rows up to two before the end. You may need to tweak this if your files have other stuff at the end:
txt = txt[c(headlines[1]+1, (headlines[2]+1):(nrow(txt)-2)),]
Convert the names into a vector:
names = scan(textConnection(gsub('"',"",txt[1,2])),what="")
Now paste the text into one big text string, and get rid of the quotes:
txt = paste0(txt$V2,sep="",collapse="\n")
txt = gsub('"','',txt)
Now treat that as a fixed-width data, and give it names:
d = read.fwf(textConnection(txt),widths=rep(7,11),skip=2)
names(d) = names
Tada!
> d
PRES HGHT TEMP DWPT RELH MIXR DRCT SKNT THTA THTE THTV
1 981 91 1.4 1.4 100 4.34 45 26 276.1 288.0 276.8
2 850 1257 3.6 3.6 100 5.86 155 28 289.9 306.9 290.9
3 700 2809 -5.1 -5.1 100 3.76 170 44 296.8 308.2 297.5
4 500 5350 -24.9 -25.7 93 0.95 180 36 302.6 305.8 302.8
5 400 6940 -36.0 -36.9 91 0.41 195 36 308.1 309.6 308.2
6 300 8870 -51.5 -52.8 86 0.10 200 46 312.6 313.0 312.7
7 250 10050 -60.3 -61.7 83 0.04 205 66 316.3 316.4 316.3
8 238 10345 -61.5 -62.9 83 0.03 200 80 319.0 319.1 319.0
9 200 11390 -65.7 -67.3 80 0.02 215 60 328.6 328.7 328.6
10 150 13170 -60.3 -61.9 81 0.06 225 56 366.0 366.3 366.0
11 100 15680 -62.1 -63.8 80 0.07 235 48 407.5 407.9 407.5
Upvotes: 1