maxbre
maxbre

Reputation: 193

R read.table: how to avoid error from column mismatch

I'm sharing here a text file a need to read into R

https://www.filehosting.org/file/details/846022/PLT_PERIOD_ALL_175_P98.PLT

this is what I've done so far

fname<-'PLT_PERIOD_ALL_175_P98.PLT'

rt0<-read.table(fname, skip=8, header = FALSE,
           stringsAsFactors = FALSE)
##Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
##               line 84 did not have 10 elements

rt1<-read.table(fname, skip=8, header = FALSE, 
               stringsAsFactors = FALSE, fill=TRUE)
# this is messing up the last two columns by row 84

rt2<-read.table(fname, skip=8, header = FALSE, 
               flush = TRUE, stringsAsFactors = FALSE, fill=TRUE)
# this is skipping one column

rt3<-read.table(fname, skip=8, header = FALSE, sep='\t',
               flush = TRUE, stringsAsFactors = FALSE, fill=TRUE)
# this is reading all in one column

Upvotes: 0

Views: 799

Answers (2)

maxbre
maxbre

Reputation: 193

here some of my many failing attempts to read first two columns (just for proceeding with a simpler subset)

> read.fwf(fname, widths = c(12,14), skip = 8, sep=' ', dec='.')[1:5,]
  V1 V2       V3 V4      V5
1 NA NA 739950.4  0 5083641
2 NA NA 739984.3  0 5083591
3 NA NA 740038.2  0 5083581
4 NA NA 740065.7  0 5083576
5 NA NA 740081.8  0 5083575

> read.fwf(fname, widths = c(15,16), skip = 8, dec='.')[1:5,]
        V1      V2
1 739950.4 5083641
2 739984.3 5083591
3 740038.2 5083581
4 740065.7 5083576
5 740081.8 5083575

for some reasons I do not fully understand it seems I can't properly parse the second column (not to mention the others)

Upvotes: 0

IRTFM
IRTFM

Reputation: 263352

There are no separators. So the way to deal with this is to use fwf file logic. There appeared to be 9 lines in the preamble text but one of them was broken with a non-standard linefeed. The 8th line has underscores that define the column locations. The 84th line of data (and subsequent lines as well) has entries in the "NET ID" column that were not present in lines 7-9 above.

*        X             Y      AVERAGE CONC    ZELEV    ZHILL    ZFLAG    AVE     GRP       RANK     NET ID   DATE(CONC)
* ____________  ____________  ____________   ______   ______   ______  ______  ________  ________  ________  ________
  739950.40000 5083641.40000      12.01732   124.31  1564.68     0.00    1-HR  ALL       175TH               19120324
  ---snipped lines------

This is what you see after your read.table( ..., fill=TRUE) operation with the added data in the 10th column with value "CAR1":

83    739277.6 5083387  0.78225 123.46 1564.68  0 1-HR ALL 175TH 19010108
84    739061.1 5082586  0.34595 122.50 1564.70  0 1-HR ALL 175TH     CAR1
85  19052321.0      NA       NA     NA      NA NA                        
86    739161.1 5082586  0.39795 122.00 1564.70  0 1-HR ALL 175TH     CAR1

Here's a trick to create a "ruler" for determining column positions. You can figure out the widths of those underscore columns (including the 2 positions at the beginning and the 3 position gaps using this output:

rt1<-readLines("~/Downloads/PLT_PERIOD_ALL_175_P98.PLT",n=9 )  # first 8 lines
rt2 <- paste0( rep(c(1:9,0),12), collapse="")     # one's place ruler
rt2[2] <- paste0( tail( rep( c(0:9, 0:1) ,each=10),-1 ),collapse="") # ten's place ruler

> rt2
[1] "123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890"
[2] "00000000011111111112222222222333333333344444444445555555555666666666677777777778888888888999999999900000000001111111111" 
> rt1[8]
[1] "* ____________  ____________  ____________   ______   ______   ______  ______  ________  ________  ________  ________"

And then read:

> ?read.fwf

You read in all the data using the widths.

Upvotes: 2

Related Questions