weejun
weejun

Reputation: 25

Ignore files with parsing errors in import (read_csv)

I have incredibly raw data in the format of a .zip with a .txt file inside. For the most part, it cleanly reads in using read_csv, but there are some lines where the data is logging something else and completely skews the column structure. This data has no chance of being fixed.

When using read_csv, it shows up as a parsing problem. I want to set up my code where if this problem appears in the data, the whole file is ignored. It'd be great if there was a log of which files were ignored/thrown out. I looked into possibly(), but since it's not a full error with the file, only the lines, it doesn't skip the file.

This is my code at the moment.

library(dplyr)
library(readr)
library(purrr)

read_log <- function(path) {
  read_csv(path, col_types = cols(.default = col_character())) %>%
    mutate(filename = basename(path))
}

test_files <- file.path("example.txt") #would normally be list.files, simplified for this reprex

raw_data <- map_dfr(test_files, read_log)
#> Warning: 6 parsing failures.
#> row col   expected     actual          file
#>   3  -- 17 columns 4 columns  'example.txt'
#>   4  -- 17 columns 23 columns 'example.txt'
#>   5  -- 17 columns 23 columns 'example.txt'
#>   6  -- 17 columns 23 columns 'example.txt'
#>   7  -- 17 columns 23 columns 'example.txt'
#> ... ... .......... .......... .............
#> See problems(...) for more details.

Upvotes: 0

Views: 777

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 388962

You can return NULL if a warning is returned. Try using this function.

library(reader)
library(purrr)
library(dplyr)

read_log <- function(path) {
     data <- tryCatch(read_csv(path,col_types = cols(.default = col_character())),
                               warning = function(e) return(NULL))
     if(!is.null(data))  
        data <- data %>% mutate(filename = basename(path))
     return(data)
}

Read the data with map instead of map_dfr :

all_data <- map(test_files, read_log)

Files which were not read

not_read_files <- test_files[sapply(all_data, is.null)]

Combine the data

total_data <- bind_rows(all_data)

Upvotes: 1

Related Questions