Reputation: 389
I have a text file from NOAA that has a list of weather stations there should he 8 columns, with no header. They are all separated by spaces but in the cases below some have a name like "ABBEVILLE" others "ANDALUSIA 3 W". When I read this into R with read.table I get 10 columns.
USC00010008 31.5702 -85.2482 139.0 AL ABBEVILLE 15 -6
USC00010252 31.3071 -86.5226 76.2 AL ANDALUSIA 3 W 15 -6
precip_stations <- read.table("hpd-stations.txt", sep = "", header = FALSE, fill = T)
> V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
>USC00010008 31.5702 -85.2482 139 AL ABBEVILLE 15 -6
>USC00010252 31.3071 -86.5226 76.2 AL ANDALUSIA 3 W 15 -6
Is there a way I can get it to import things like "ANDALUSIA 3 W" in one column? I thought that fill = FALSE might work but if I do that I get an error.
Upvotes: 0
Views: 2950
Reputation: 263332
Reading fixed width format files requires that you first recognize there are no valid separators, then using `read.fwf. That in turn requires that you determine the spacing to calculate the widths:
inp <- "
USC00010008 31.5702 -85.2482 139.0 AL ABBEVILLE 15 -6
USC00010063 34.2553 -87.1814 249.3 AL ADDISON 15 -6"
> cat( paste0( rep( 0:9, 13), collapse=""))
0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
Note that proper text editors may sometimes help with a "ruler". I sometimes make a ruler with:
cat( paste0( rep( c(1:9,0), 7), collapse="")); cat("\n");
cat(paste0( sprintf( "% 10s", 1:7), collapse=""))
#---------------------------------------
#1234567890123456789012345678901234567890123456789012345678901234567890
# 1 2 3 4 5 6 7
In fact, I've just realized I should be making a function that creates an adjustable ruler:
ruler <- function(len=7){ cat( paste0( rep( c(1:9,0), len), collapse=""))
cat("\n")
cat(paste0( sprintf( "% 10s", 1:len), collapse="")) }
> ruler(13) # could not get a proper display of the output in SO
(Now saved to my .Rprofile suite of utility functions.)
Then with a guess at the widths and a couple of adjustments after this error:
> read.fwf(textConnection(inp), widths=c(10, 10,10, 7, 90, 5, 5) )
V1 V2 V3 V4 V5 V6 V7
1 <NA> <NA> NA NA <NA> NA NA
2 USC0001000 8 31.5702 -85.2482 139.0 AL ABBEVILLE 1 5
3 USC0001006 3 34.2553 -87.1814 249.3 AL ADDISON 1 5
I finally got:
myfwf <- readLines("ftp://ftp.ncdc.noaa.gov/pub/data/hpd/auto/v1/beta/hpd-stations.txt" )
> mytbl <- read.fwf(textConnection(myfwf[1:12]), widths=c(11, 10,10, 7, 89, 5, 5) )
> mytbl
V1 V2 V3 V4 V5 V6 V7
1 USC00010008 31.5702 -85.2482 139.0 AL ABBEVILLE 1 5
2 USC00010063 34.2553 -87.1814 249.3 AL ADDISON 1 5
3 USC00010140 32.2322 -87.4104 53.3 AL ALBERTA 1 5
4 USC00010252 31.3071 -86.5226 76.2 AL ANDALUSIA 3 W 1 5
5 USC00010369 33.2941 -85.7788 311.5 AL ASHLAND 3 ENE 1 5
6 USC00010390 34.7752 -86.9508 210.0 AL ATHENS 1 5
7 USC00010402 31.1820 -87.4390 91.4 AL ATMORE 1 5
8 USC00010425 32.5992 -85.4653 166.1 AL AUBURN NO.2 1 5
9 USC00010748 33.6972 -87.6491 157.9 AL BERRY 3 NW 1 5
10 USC00010957 34.2008 -86.1633 326.1 AL BOAZ 1 5
11 USC00011099 34.9786 -85.8008 204.2 AL BRIDGEPORT 5 NW 1 5
12 USC00012124 32.8622 -85.7358 223.4 AL DADEVILLE 2 1 5
Notice that I restricted that to the first 12 lines because line 13 had an octothorpe ("#") which results in premature line termination since that is the R comment character.
USC00012172 30.2505 -88.0775 2.4 AL DAUPHIN ISLAND #2 15 -6
So need comment.char=""
:
> mytbl <- read.fwf(textConnection(myfwf), widths=c(11, 10,10, 7, 89, 5, 5) , comment.char="")
> str(mytbl)
'data.frame': 1920 obs. of 7 variables:
$ V1: Factor w/ 1920 levels "CQC00914080",..: 23 24 25 26 27 28 29 30 31 32 ...
$ V2: num 31.6 34.3 32.2 31.3 33.3 ...
$ V3: num -85.2 -87.2 -87.4 -86.5 -85.8 ...
$ V4: num 139 249.3 53.3 76.2 311.5 ...
$ V5: Factor w/ 1920 levels "AK CENTRAL NO 2 ",..: 8 9 10 11 12 13 14 15 16 17 ...
$ V6: Factor w/ 9 levels " 1","0 1",..: 1 1 1 1 1 1 1 1 1 1 ...
$ V7: Factor w/ 2 levels "5 ","5 -": 1 1 1 1 1 1 1 1 1 1 ...
Upvotes: 2
Reputation: 1257
Ugly, but works:
x <- readLines("hpd-stations.txt")
l <- list()
for (i in 1:length(x)) {
a <- strsplit(x[i], " ")
b <- t(sapply(a, "[", 1:5))
c <- length(a[[1]]) - 5 - 2
d <- t(sapply(a, "[", 6:(5+c)))
e <- t(sapply(a, "[", (5+c+1):length(a[[1]])))
res <- as.data.frame(cbind(b, paste0(d, collapse = " "), e))
l[[i]] <- res
}
data <- dplyr::bind_rows(l)
Gives you
> data
V1 V2 V3 V4 V5 V6 V7 V8
1 USC00010008 31.5702 -85.2482 139 AL ABBEVILLE 15 -6
2 USC00010009 31.5602 -85.2482 179 AL ABBEVILLFOO 3 W 15 -6
Tested with dummy data.
EDIT: Just saw that a link to the file appeared. Haven't tested with that, but you should consider using read.fwf()
anyway now.
Upvotes: 0