user2096665
user2096665

Reputation: 123

Extract embedded tables

I have a series of csv files that I am trying to analyse. The files have this structure.

"","x"
"1","<HTML>"
"2","<TITLE>Radiosonde Data</TITLE>"
"3","<LINK REL=""StyleSheet"" HREF=""/resources/select.css"" TYPE=""text/css"">"
"4","<BODY BGCOLOR=""white"">"
"5","<H2>15420 LRBS Bucuresti Observations at 00Z 01 Jan 1980</H2>"
"6","<PRE>"
"7","-----------------------------------------------------------------------------"
"8","   PRES   HGHT   TEMP   DWPT   RELH   MIXR   DRCT   SKNT   THTA   THTE   THTV"
"9","    hPa     m      C      C      %    g/kg    deg   knot     K      K      K "
"10","-----------------------------------------------------------------------------"
"11"," 1000.0    -55                                                               "
"12","  981.0     91    1.4    1.4    100   4.34     45     26  276.1  288.0  276.8"
"13","  850.0   1257    3.6    3.6    100   5.86    155     28  289.9  306.9  290.9"
"14","  700.0   2809   -5.1   -5.1    100   3.76    170     44  296.8  308.2  297.5"
"15","  500.0   5350  -24.9  -25.7     93   0.95    180     36  302.6  305.8  302.8"
"16","  400.0   6940  -36.0  -36.9     91   0.41    195     36  308.1  309.6  308.2"
"17","  300.0   8870  -51.5  -52.8     86   0.10    200     46  312.6  313.0  312.7"
"18","  250.0  10050  -60.3  -61.7     83   0.04    205     66  316.3  316.4  316.3"
"19","  238.0  10345  -61.5  -62.9     83   0.03    200     80  319.0  319.1  319.0"
"20","  200.0  11390  -65.7  -67.3     80   0.02    215     60  328.6  328.7  328.6"
"21","  150.0  13170  -60.3  -61.9     81   0.06    225     56  366.0  366.3  366.0"
"22","  100.0  15680  -62.1  -63.8     80   0.07    235     48  407.5  407.9  407.5""
"23","</PRE><H3>Station info</H3><PRE>"
"24","                         Station ID: LRBS"

The aim is to extract the embedded tables (the data between lines 11-22 in the above example) as separate data frames with 11 columns. So far have I have written a script that reads all the text lines (using readLines) and then identifies the start and the end of each table. But in this way the data manipulation is not very easy. I am wondering what will be a more efficient way of extracting the tables.

Thank you very much for all your help.

Upvotes: 0

Views: 68

Answers (1)

Spacedman
Spacedman

Reputation: 94277

Given a file called ick.csv in that format:

Read into two columns - ignore all quotes:

txt = read.table("ick.csv",head=FALSE,quote='',sep=',',skip=1, as.is=TRUE)

Now where does the data start? Look for a load of minus signs in the second column:

headlines = grep("------",txt$V2)

There should be two of those. The header labels are on the line after the first one, and the data starts after the second one, and we want all rows up to two before the end. You may need to tweak this if your files have other stuff at the end:

txt = txt[c(headlines[1]+1, (headlines[2]+1):(nrow(txt)-2)),]

Convert the names into a vector:

names = scan(textConnection(gsub('"',"",txt[1,2])),what="")

Now paste the text into one big text string, and get rid of the quotes:

txt = paste0(txt$V2,sep="",collapse="\n")
txt = gsub('"','',txt)

Now treat that as a fixed-width data, and give it names:

d = read.fwf(textConnection(txt),widths=rep(7,11),skip=2)
names(d) = names

Tada!

> d
   PRES  HGHT  TEMP  DWPT RELH MIXR DRCT SKNT  THTA  THTE  THTV
1   981    91   1.4   1.4  100 4.34   45   26 276.1 288.0 276.8
2   850  1257   3.6   3.6  100 5.86  155   28 289.9 306.9 290.9
3   700  2809  -5.1  -5.1  100 3.76  170   44 296.8 308.2 297.5
4   500  5350 -24.9 -25.7   93 0.95  180   36 302.6 305.8 302.8
5   400  6940 -36.0 -36.9   91 0.41  195   36 308.1 309.6 308.2
6   300  8870 -51.5 -52.8   86 0.10  200   46 312.6 313.0 312.7
7   250 10050 -60.3 -61.7   83 0.04  205   66 316.3 316.4 316.3
8   238 10345 -61.5 -62.9   83 0.03  200   80 319.0 319.1 319.0
9   200 11390 -65.7 -67.3   80 0.02  215   60 328.6 328.7 328.6
10  150 13170 -60.3 -61.9   81 0.06  225   56 366.0 366.3 366.0
11  100 15680 -62.1 -63.8   80 0.07  235   48 407.5 407.9 407.5

Upvotes: 1

Related Questions