JRR
JRR

Reputation: 588

R: read fwf file with missing final column

I am attempting to parse a fixed-width .txt files with readr's read_fwf. There are ~1.5 million observations and approx. 550 of them are missing the final 25 of 60 variables. This omission leads to the inperfect parsing of the final variable that these observations do have, 'description' in the example below, and leaves the dataframe without these partially filled columns.

For example,

df_baseline <- read_fwf(file = file, fwf_widths(fwf_widths, fwf_names), 
                         col_types = col_types, trim_ws = T) %>% 
   mutate_all(na_if, "")
Warning: 1148 parsing failures.
row         col   expected     actual file
300495 description 240 chars  102        '/path/to/my/file/filename.txt'
300495 NA          59 columns 31 columns '/path/to/my/file/filename.txt' 
500245 description 240 chars  56         '/path/to/my/file/filename.txt' 
500245 NA          59 columns 31 columns '/path/to/my/file/filename.txt' 
500333 description 240 chars  33         '/path/to/my/file/filename.txt' 
See problems(...) for more details.

col_types is a string of 60 'c' symbols in a row so that all columns are read-in as character. fwf_widths and fwf_names are appropriately specifications for the proposed column widths and column titles.

I understand that by having missing values in the final column of the df, I am violating the "fixed-width" nature of the document.

Is there a way that I can 1) Get read_fwf to retain these partially filled rows? 2) If not, how can I read in this txt file given that 99% of it can be parsed according to a normal FWF?

Upvotes: 0

Views: 945

Answers (2)

C. Rea
C. Rea

Reputation: 131

The question is ambiguous and therefore difficult to answer directly or precisely, but the fwf file ABCD.txt illustrates three cases the OP might be asking about:

# ABCD.txt
# 1ABCD
# 2AB
# 3AB D
# 4ABD
# 5ABCD
#

Rows 1 and 5 have no missing values and can be parsed without issue.

Rows 2 and 3 (the first truncated after three values; the second with an empty placeholder in the fourth column) can also be parsed without issue by read_fwf, although there will be a warning (like that cited by the OP) about the truncation in row 2 (and row 4, which we treat below):

widths <- c(1,1,1,1,1)
file <- "ABCD.txt"

abc <- read_fwf(
  file = path,
  fwf_widths(widths),
  col_types = "ccccc"
  )

abc

Output:

Warning: 3 parsing failures.
row col  expected    actual       file
  2  X4 1 chars   0         'ABCD.txt'
  2  -- 5 columns 4 columns 'ABCD.txt'
  4  X5 1 chars   0         'ABCD.txt'

# A tibble: 5 x 5
  X1    X2    X3    X4    X5   
  <chr> <chr> <chr> <chr> <chr>
1 1     A     B     C     D    
2 2     A     B     NA    NA   
3 3     A     B     NA    D    
4 4     A     B     D     NA   
5 5     A     B     C     D    

Note that read_fwf fills the truncated rows with NA for the missing values.

read.fwf with fill = TRUE also works, although it is slower and will not throw any warnings:

abc <- read.fwf(
  path,
  widths =  widths,
  colClasses = "character",
  na.strings = c(" ","NA"),
  fill = TRUE
  )

abc

Output:

  V1 V2 V3   V4   V5
1  1  A  B    C    D
2  2  A  B <NA> <NA>
3  3  A  B <NA>    D
4  4  A  B    D <NA>
5  5  A  B    C    D

Row 4, however, is not parsed "correctly" with either approach if we know that the value D is supposed to be in the fifth column. (There is no way for read_fwf or read.fwf to know this, though, so strictly speaking, there is no parsing error.)

There are a number of ways to handle this issue, but if the problem is consistent throughout the fwf file (e.g. the final 25 of 60 variables are missing in all such cases, as specified in the question), then one solution is to use dplyr to transpose the improperly placed D value from the fourth to the fifth column (or from the 35th to the 60th column, in the OP's case):

abc <- abc %>%
  mutate(
    V5 = case_when(
      is.na(V5) & !is.na(V4) ~ V4,
      !is.na(V5) ~ V5
      ),
    V4 = case_when(
      V4!=V5 ~ V4
      )
    )

abc

Output:

  V1 V2 V3   V4   V5
1  1  A  B    C    D
2  2  A  B <NA> <NA>
3  3  A  B <NA>    D
4  4  A  B <NA>    D
5  5  A  B    C    D

Upvotes: 0

Rohit
Rohit

Reputation: 2017

You can use data.table::fread(). It automatically detects fixed width formats and using the option fill=TRUE should give you what you want:

#abc.txt
#a   b   c   d
#1   2   3   4
#1   2   3   4
#2   3
#1   4   3   2
library(data.table)
fread('abc.txt',fill = T)
#    a b  c  d
# 1: 1 2  3  4
# 2: 1 2  3  4
# 3: 2 3 NA NA
# 4: 1 4  3  2

Upvotes: 1

Related Questions