Extract embedded tables

Question

I have a series of csv files that I am trying to analyse. The files have this structure.

"","x" "1","" "2","Radiosonde Data" "3","" "4","" "5","15420 LRBS Bucuresti Observations at 00Z 01 Jan 1980" "6","" "7","-----------------------------------------------------------------------------" "8"," PRES HGHT TEMP DWPT RELH MIXR DRCT SKNT THTA THTE THTV" "9"," hPa m C C % g/kg deg knot K K K " "10","-----------------------------------------------------------------------------" "11"," 1000.0 -55 " "12"," 981.0 91 1.4 1.4 100 4.34 45 26 276.1 288.0 276.8" "13"," 850.0 1257 3.6 3.6 100 5.86 155 28 289.9 306.9 290.9" "14"," 700.0 2809 -5.1 -5.1 100 3.76 170 44 296.8 308.2 297.5" "15"," 500.0 5350 -24.9 -25.7 93 0.95 180 36 302.6 305.8 302.8" "16"," 400.0 6940 -36.0 -36.9 91 0.41 195 36 308.1 309.6 308.2" "17"," 300.0 8870 -51.5 -52.8 86 0.10 200 46 312.6 313.0 312.7" "18"," 250.0 10050 -60.3 -61.7 83 0.04 205 66 316.3 316.4 316.3" "19"," 238.0 10345 -61.5 -62.9 83 0.03 200 80 319.0 319.1 319.0" "20"," 200.0 11390 -65.7 -67.3 80 0.02 215 60 328.6 328.7 328.6" "21"," 150.0 13170 -60.3 -61.9 81 0.06 225 56 366.0 366.3 366.0" "22"," 100.0 15680 -62.1 -63.8 80 0.07 235 48 407.5 407.9 407.5"" "23"," Station info

"
"24","                         Station ID: LRBS"

The aim is to extract the embedded tables (the data between lines 11-22 in the above example) as separate data frames with 11 columns. So far have I have written a script that reads all the text lines (using readLines) and then identifies the start and the end of each table. But in this way the data manipulation is not very easy. I am wondering what will be a more efficient way of extracting the tables.

Thank you very much for all your help.

Spacedman · Accepted Answer

Given a file called ick.csv in that format:

Read into two columns - ignore all quotes:

txt = read.table("ick.csv",head=FALSE,quote='',sep=',',skip=1, as.is=TRUE)

Now where does the data start? Look for a load of minus signs in the second column:

headlines = grep("------",txt$V2)

There should be two of those. The header labels are on the line after the first one, and the data starts after the second one, and we want all rows up to two before the end. You may need to tweak this if your files have other stuff at the end:

txt = txt[c(headlines[1]+1, (headlines[2]+1):(nrow(txt)-2)),]

Convert the names into a vector:

names = scan(textConnection(gsub('"',"",txt[1,2])),what="")

Now paste the text into one big text string, and get rid of the quotes:

txt = paste0(txt$V2,sep="",collapse="
")
txt = gsub('"','',txt)

Now treat that as a fixed-width data, and give it names:

d = read.fwf(textConnection(txt),widths=rep(7,11),skip=2)
names(d) = names

Tada!

> d
   PRES  HGHT  TEMP  DWPT RELH MIXR DRCT SKNT  THTA  THTE  THTV
1   981    91   1.4   1.4  100 4.34   45   26 276.1 288.0 276.8
2   850  1257   3.6   3.6  100 5.86  155   28 289.9 306.9 290.9
3   700  2809  -5.1  -5.1  100 3.76  170   44 296.8 308.2 297.5
4   500  5350 -24.9 -25.7   93 0.95  180   36 302.6 305.8 302.8
5   400  6940 -36.0 -36.9   91 0.41  195   36 308.1 309.6 308.2
6   300  8870 -51.5 -52.8   86 0.10  200   46 312.6 313.0 312.7
7   250 10050 -60.3 -61.7   83 0.04  205   66 316.3 316.4 316.3
8   238 10345 -61.5 -62.9   83 0.03  200   80 319.0 319.1 319.0
9   200 11390 -65.7 -67.3   80 0.02  215   60 328.6 328.7 328.6
10  150 13170 -60.3 -61.9   81 0.06  225   56 366.0 366.3 366.0
11  100 15680 -62.1 -63.8   80 0.07  235   48 407.5 407.9 407.5

Extract embedded tables

Answers (1)

Related Questions