Reputation: 4686
I am writing a function to read a text file with fixed width format. The challenge is that the number of columns is not known in advance (it will vary from file to file), so I can't specify the widths
vector for use with read.fwf()
.
The file uses space as separators, and the general format is: 20-character, 4-char, 3-char, 4-char, 3-char, ... repeating the pair of 4-char(space)3-char(space) combination for whatever number is needed.
A sample of the file would be something like
Robert De Niro 382 +19 2504 14 346 +16 2445 18 2413 +20 2445 17
Marlon Brando 2427 +13 2495 19 2483 +14 2429 16 2438 +18 2378 20
Martin Scorsese 2501 7 317 +3 2491 1 393 +2 2462 4 394 +9
The example above has 6 pairs of the columns in the entire file. Other files may have as many as 33 pairs of the columns.
At the moment my work around is to manually inspect each file beforehand to specify the widths
value. Any suggestions on possible approach to automate this?
Upvotes: 0
Views: 589
Reputation: 263471
You could read for the max and then post-process to remove all columns with all NA:
> read.fwf(textConnection("Robert De Niro 382 +19 2504 14 346 +16 2445 18 2413 +20 2445 17
+ Marlon Brando 2427 +13 2495 19 2483 +14 2429 16 2438 +18 2378 20
+ Martin Scorsese 2501 7 317 +3 2491 1 393 +2 2462 4 394 +9"), widths=c(20, rep(c(5,4), 33) ) )
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14
1 Robert De Niro 382 19 2504 14 346 16 2445 18 2413 20 2445 17 NA
2 Marlon Brando 2427 13 2495 19 2483 14 2429 16 2438 18 2378 20 NA
3 Martin Scorsese 2501 7 317 3 2491 1 393 2 2462 4 394 9 NA
V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33
1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
3 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
V34 V35 V36 V37 V38 V39 V40 V41 V42 V43 V44 V45 V46 V47 V48 V49 V50 V51 V52
1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
3 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
V53 V54 V55 V56 V57 V58 V59 V60 V61 V62 V63 V64 V65 V66 V67
1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
3 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Use colClasses if you need to keep the '+'-signs. In particular following up on Ananda's contribution this might work as a one-liner using the more flexible read.table() approach:
fil <- <insert-file-name-here>
read.table(text=gsub(" +([[:alpha:]]+)", "_\\1", textConnection( file(fil) ),
col.Classes=c("character", "numeric") )
The colClasses will get repeated as many times as needed, so for instance an 11-column file wold get read as: c( "character", "numeric", "character", "numeric", "character", "numeric", "character", "numeric", "character", "numeric", "character")
Upvotes: 1
Reputation: 193677
This is a trick I learned somewhere here on Stack Overflow (my snippet says I learned it from @BenBolker, but I can't find the link right now), but will only work if your data are in the format you describe: text followed by numbers.
Let's say we have the following text:
TEXT <- c(
"Robert De Niro 382 +19 2504 14 346 +16 2445 18 2413 +20 2445 17",
"Marlon Brando 2427 +13 2495 19 2483 +14 2429 16 2438 +18 2378 20",
"Martin Scorsese 2501 7 317 +3 2491 1 393 +2 2462 4 394 +9")
We can use gsub
to replace the spaces in the word with another character--say an underscore or a dash:
gsub(" +([[:alpha:]]+)", "_\\1", TEXT)
# [1] "Robert_De_Niro 382 +19 2504 14 346 +16 2445 18 2413 +20 2445 17"
# [2] "Marlon_Brando 2427 +13 2495 19 2483 +14 2429 16 2438 +18 2378 20"
# [3] "Martin_Scorsese 2501 7 317 +3 2491 1 393 +2 2462 4 394 +9"
This will allow us to use read.table
directly:
read.table(text = gsub(" +([[:alpha:]]+)", "_\\1", text), header = FALSE)
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13
# 1 Robert_De_Niro 382 19 2504 14 346 16 2445 18 2413 20 2445 17
# 2 Marlon_Brando 2427 13 2495 19 2483 14 2429 16 2438 18 2378 20
# 3 Martin_Scorsese 2501 7 317 3 2491 1 393 2 2462 4 394 9
As @BondedDust has mentioned, you can specify colClasses = "character"
if you want to keep the "+" before the numbers, but then your numbers will be characters :-)
Upvotes: 2