Reputation: 588
I am attempting to parse a fixed-width .txt files with readr
's read_fwf
. There are ~1.5 million observations and approx. 550 of them are missing the final 25 of 60 variables. This omission leads to the inperfect parsing of the final variable that these observations do have, 'description' in the example below, and leaves the dataframe without these partially filled columns.
For example,
df_baseline <- read_fwf(file = file, fwf_widths(fwf_widths, fwf_names),
col_types = col_types, trim_ws = T) %>%
mutate_all(na_if, "")
Warning: 1148 parsing failures.
row col expected actual file
300495 description 240 chars 102 '/path/to/my/file/filename.txt'
300495 NA 59 columns 31 columns '/path/to/my/file/filename.txt'
500245 description 240 chars 56 '/path/to/my/file/filename.txt'
500245 NA 59 columns 31 columns '/path/to/my/file/filename.txt'
500333 description 240 chars 33 '/path/to/my/file/filename.txt'
See problems(...) for more details.
col_types
is a string of 60 'c'
symbols in a row so that all columns are read-in as character. fwf_widths
and fwf_names
are appropriately specifications for the proposed column widths and column titles.
I understand that by having missing values in the final column of the df, I am violating the "fixed-width" nature of the document.
Is there a way that I can 1) Get read_fwf
to retain these partially filled rows?
2) If not, how can I read in this txt file given that 99% of it can be parsed according to a normal FWF?
Upvotes: 0
Views: 945
Reputation: 131
The question is ambiguous and therefore difficult to answer directly or precisely, but the fwf file ABCD.txt illustrates three cases the OP might be asking about:
# ABCD.txt
# 1ABCD
# 2AB
# 3AB D
# 4ABD
# 5ABCD
#
Rows 1 and 5 have no missing values and can be parsed without issue.
Rows 2 and 3 (the first truncated after three values; the second with an empty placeholder in the fourth column) can also be parsed without issue by read_fwf
, although there will be a warning (like that cited by the OP) about the truncation in row 2 (and row 4, which we treat below):
widths <- c(1,1,1,1,1)
file <- "ABCD.txt"
abc <- read_fwf(
file = path,
fwf_widths(widths),
col_types = "ccccc"
)
abc
Output:
Warning: 3 parsing failures.
row col expected actual file
2 X4 1 chars 0 'ABCD.txt'
2 -- 5 columns 4 columns 'ABCD.txt'
4 X5 1 chars 0 'ABCD.txt'
# A tibble: 5 x 5
X1 X2 X3 X4 X5
<chr> <chr> <chr> <chr> <chr>
1 1 A B C D
2 2 A B NA NA
3 3 A B NA D
4 4 A B D NA
5 5 A B C D
Note that read_fwf
fills the truncated rows with NA
for the missing values.
read.fwf
with fill = TRUE
also works, although it is slower and will not throw any warnings:
abc <- read.fwf(
path,
widths = widths,
colClasses = "character",
na.strings = c(" ","NA"),
fill = TRUE
)
abc
Output:
V1 V2 V3 V4 V5
1 1 A B C D
2 2 A B <NA> <NA>
3 3 A B <NA> D
4 4 A B D <NA>
5 5 A B C D
Row 4, however, is not parsed "correctly" with either approach if we know that the value D
is supposed to be in the fifth column. (There is no way for read_fwf
or read.fwf
to know this, though, so strictly speaking, there is no parsing error.)
There are a number of ways to handle this issue, but if the problem is consistent throughout the fwf file (e.g. the final 25 of 60 variables are missing in all such cases, as specified in the question), then one solution is to use dplyr
to transpose the improperly placed D
value from the fourth to the fifth column (or from the 35th to the 60th column, in the OP's case):
abc <- abc %>%
mutate(
V5 = case_when(
is.na(V5) & !is.na(V4) ~ V4,
!is.na(V5) ~ V5
),
V4 = case_when(
V4!=V5 ~ V4
)
)
abc
Output:
V1 V2 V3 V4 V5
1 1 A B C D
2 2 A B <NA> <NA>
3 3 A B <NA> D
4 4 A B <NA> D
5 5 A B C D
Upvotes: 0
Reputation: 2017
You can use data.table::fread()
. It automatically detects fixed width formats and using the option fill=TRUE
should give you what you want:
#abc.txt
#a b c d
#1 2 3 4
#1 2 3 4
#2 3
#1 4 3 2
library(data.table)
fread('abc.txt',fill = T)
# a b c d
# 1: 1 2 3 4
# 2: 1 2 3 4
# 3: 2 3 NA NA
# 4: 1 4 3 2
Upvotes: 1