Ricky
Ricky

Reputation: 4686

Reading fixed width file with number of columns not known in advance

I am writing a function to read a text file with fixed width format. The challenge is that the number of columns is not known in advance (it will vary from file to file), so I can't specify the widths vector for use with read.fwf().

The file uses space as separators, and the general format is: 20-character, 4-char, 3-char, 4-char, 3-char, ... repeating the pair of 4-char(space)3-char(space) combination for whatever number is needed.

A sample of the file would be something like

Robert De Niro        382 +19 2504  14  346 +16 2445  18 2413 +20 2445  17
Marlon Brando        2427 +13 2495  19 2483 +14 2429  16 2438 +18 2378  20
Martin Scorsese      2501   7  317  +3 2491   1  393  +2 2462   4  394  +9

The example above has 6 pairs of the columns in the entire file. Other files may have as many as 33 pairs of the columns.

At the moment my work around is to manually inspect each file beforehand to specify the widths value. Any suggestions on possible approach to automate this?

Upvotes: 0

Views: 589

Answers (2)

IRTFM
IRTFM

Reputation: 263471

You could read for the max and then post-process to remove all columns with all NA:

> read.fwf(textConnection("Robert De Niro        382 +19 2504  14  346 +16 2445  18 2413 +20 2445  17
+ Marlon Brando        2427 +13 2495  19 2483 +14 2429  16 2438 +18 2378  20
+ Martin Scorsese      2501   7  317  +3 2491   1  393  +2 2462   4  394  +9"), widths=c(20, rep(c(5,4), 33) ) )
                    V1   V2 V3   V4 V5   V6 V7   V8 V9  V10 V11  V12 V13 V14
1 Robert De Niro        382 19 2504 14  346 16 2445 18 2413  20 2445  17  NA
2 Marlon Brando        2427 13 2495 19 2483 14 2429 16 2438  18 2378  20  NA
3 Martin Scorsese      2501  7  317  3 2491  1  393  2 2462   4  394   9  NA
  V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33
1  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
2  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
3  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
  V34 V35 V36 V37 V38 V39 V40 V41 V42 V43 V44 V45 V46 V47 V48 V49 V50 V51 V52
1  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
2  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
3  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
  V53 V54 V55 V56 V57 V58 V59 V60 V61 V62 V63 V64 V65 V66 V67
1  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
2  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
3  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA

Use colClasses if you need to keep the '+'-signs. In particular following up on Ananda's contribution this might work as a one-liner using the more flexible read.table() approach:

 fil <- <insert-file-name-here>
  read.table(text=gsub(" +([[:alpha:]]+)", "_\\1", textConnection( file(fil) ), 
             col.Classes=c("character", "numeric") )

The colClasses will get repeated as many times as needed, so for instance an 11-column file wold get read as: c( "character", "numeric", "character", "numeric", "character", "numeric", "character", "numeric", "character", "numeric", "character")

Upvotes: 1

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193677

This is a trick I learned somewhere here on Stack Overflow (my snippet says I learned it from @BenBolker, but I can't find the link right now), but will only work if your data are in the format you describe: text followed by numbers.

Let's say we have the following text:

TEXT <- c(
  "Robert De Niro        382 +19 2504  14  346 +16 2445  18 2413 +20 2445  17",
  "Marlon Brando        2427 +13 2495  19 2483 +14 2429  16 2438 +18 2378  20",
  "Martin Scorsese      2501   7  317  +3 2491   1  393  +2 2462   4  394  +9")

We can use gsub to replace the spaces in the word with another character--say an underscore or a dash:

gsub(" +([[:alpha:]]+)", "_\\1", TEXT)
# [1] "Robert_De_Niro        382 +19 2504  14  346 +16 2445  18 2413 +20 2445  17"
# [2] "Marlon_Brando        2427 +13 2495  19 2483 +14 2429  16 2438 +18 2378  20"
# [3] "Martin_Scorsese      2501   7  317  +3 2491   1  393  +2 2462   4  394  +9"

This will allow us to use read.table directly:

read.table(text = gsub(" +([[:alpha:]]+)", "_\\1", text), header = FALSE)
#                V1   V2 V3   V4 V5   V6 V7   V8 V9  V10 V11  V12 V13
# 1  Robert_De_Niro  382 19 2504 14  346 16 2445 18 2413  20 2445  17
# 2   Marlon_Brando 2427 13 2495 19 2483 14 2429 16 2438  18 2378  20
# 3 Martin_Scorsese 2501  7  317  3 2491  1  393  2 2462   4  394   9

As @BondedDust has mentioned, you can specify colClasses = "character" if you want to keep the "+" before the numbers, but then your numbers will be characters :-)

Upvotes: 2

Related Questions