Reading large data with messy strings and multiple string indicators R

Question

I have a large (8GB+) csv file (comma-separated) that I want to read into R. The file contains three columns

date #in 2017-12-27 format
text #a string
type #a label per string (either NA, typeA, or typeB)

The problem I encounter is that the text column contains various string indicators: ' (single quot. marks), " (double quot. marks), no quot. marks, as well as multiple separated strings.

E.g.

date        text                        type
2016-01-01  great job!                  NA
2016-01-02  please, type "submit"       typeA
2016-01-02  "can't see the "error" now" typeA
2016-01-03  "add \"/filename.txt\""   NA

To read these large data, I tried:

Base read.csv and readr's read_csv function: work fine for a portion but fail (probably due to memory) or take ages to read
chunking the data via Mac terminal into batches of 1m lines: fails because lines seem to break arbitrarily
Using fread (preferred as I hope this will solve the two other issues): fails with Error: Expecting 3 cols, but line 1103 contains text after processing all cols.

My idea is to work around these issues by using specifics of the data that I know, i.e. that each line starts with a date and ends with either NA, typeA, or typeB.

How could I implement this (either using pure readLines or into fread)?

Edit: Sample data (anonymized) as opened with Mac TextWrangler:

"date","text","type"
"2016-03-30","Maybe use `tapply` from `base`, and check how that works.",NA
"2016-04-01","Fiex this now. Please check.","typeA"
"2016-04-01","Does it work? Maybe try the other approach.","typeB"
"2016-04-01","This won't work. You should remove ABC ... each line starts with a date and ends with ... and this line is veeeeeeeeeeeeeeeeeery long.",NA
"2014-05-02","Tried to remove ""../"" but no success @myid",typeA

Sample data 2:

"date","text","type"
"2018-05-02","i try this, but it doesnt work",NA
"2018-05-02","Thank you very much. Cheers !!",NA
"2018-05-02","@myid. I'll change this.",NA

Sample data for reproducible fread error "Expecting 3 cols, but line 3 contains text after processing all cols.":

"date","text","type"
"2015-03-02","Some text, some text, some question? Please, some question?",NA
"2015-03-02","Here you have the error ""Can’t access {file ""Macintosh HD:abc:def:filename"", ""/abc.txt""} from directory."" something -1100 from {file ""Macintosh HD:abc:def:filename"", ""/abc.txt""} to file",NA
"2015-03-02","good idea",NA
"2015-03-02","Worked perfectly :)",NA

SessionInfo:

R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.5

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.10.4-3 readr_1.1.1        

loaded via a namespace (and not attached):
[1] compiler_3.5.0   assertthat_0.2.0 R6_2.2.2         cli_1.0.0       
[5] hms_0.4.2        tools_3.5.0      pillar_1.2.2     rstudioapi_0.7  
[9] tibble_1.4.2     yaml_2.1.19      crayon_1.3.4     Rcpp_0.12.16    
[13] utf8_1.1.3       pkgconfig_2.0.1  rlang_0.2.0

Reading large data with messy strings and multiple string indicators R

Answers (1)

Related Questions