Reading fixed width file with number of columns not known in advance

Question

I am writing a function to read a text file with fixed width format. The challenge is that the number of columns is not known in advance (it will vary from file to file), so I can't specify the widths vector for use with read.fwf().

The file uses space as separators, and the general format is: 20-character, 4-char, 3-char, 4-char, 3-char, ... repeating the pair of 4-char(space)3-char(space) combination for whatever number is needed.

A sample of the file would be something like

Robert De Niro        382 +19 2504  14  346 +16 2445  18 2413 +20 2445  17
Marlon Brando        2427 +13 2495  19 2483 +14 2429  16 2438 +18 2378  20
Martin Scorsese      2501   7  317  +3 2491   1  393  +2 2462   4  394  +9

The example above has 6 pairs of the columns in the entire file. Other files may have as many as 33 pairs of the columns.

At the moment my work around is to manually inspect each file beforehand to specify the widths value. Any suggestions on possible approach to automate this?

A5C1D2H2I1M1N2O1R2T1 · Accepted Answer

This is a trick I learned somewhere here on Stack Overflow (my snippet says I learned it from @BenBolker, but I can't find the link right now), but will only work if your data are in the format you describe: text followed by numbers.

Let's say we have the following text:

TEXT <- c(
  "Robert De Niro        382 +19 2504  14  346 +16 2445  18 2413 +20 2445  17",
  "Marlon Brando        2427 +13 2495  19 2483 +14 2429  16 2438 +18 2378  20",
  "Martin Scorsese      2501   7  317  +3 2491   1  393  +2 2462   4  394  +9")

We can use gsub to replace the spaces in the word with another character--say an underscore or a dash:

gsub(" +([[:alpha:]]+)", "_\1", TEXT)
# [1] "Robert_De_Niro        382 +19 2504  14  346 +16 2445  18 2413 +20 2445  17"
# [2] "Marlon_Brando        2427 +13 2495  19 2483 +14 2429  16 2438 +18 2378  20"
# [3] "Martin_Scorsese      2501   7  317  +3 2491   1  393  +2 2462   4  394  +9"

This will allow us to use read.table directly:

read.table(text = gsub(" +([[:alpha:]]+)", "_\1", text), header = FALSE)
#                V1   V2 V3   V4 V5   V6 V7   V8 V9  V10 V11  V12 V13
# 1  Robert_De_Niro  382 19 2504 14  346 16 2445 18 2413  20 2445  17
# 2   Marlon_Brando 2427 13 2495 19 2483 14 2429 16 2438  18 2378  20
# 3 Martin_Scorsese 2501  7  317  3 2491  1  393  2 2462   4  394   9

As @BondedDust has mentioned, you can specify colClasses = "character" if you want to keep the "+" before the numbers, but then your numbers will be characters :-)

Reading fixed width file with number of columns not known in advance

Answers (2)

Related Questions