Reputation: 373
Sample input tab-delimited text file, note there is bad data from this source file, the enclosing " at end of line 3 is two lines down. So there is 1 complete blank line, followed by a line with just the double-quote character, then continued good data on the next line.
id ca cb cc cd
1 hi bye hey nope
2 ab cd ef "quoted text here"
3 gh ij kl "quoted text but end quote is 2 lines down
"
4 mn op qr lalalala
when I read this into R, tried using read.csv and fread, with/without 'blank.lines.skip = T' for fread, I get the following data table:
id ca cb cc cd
1 1 hi bye hey nope
2 2 ab cd ef quoted text here
3 3 gh ij kl quoted text but end quote is 2 lines down
4 4 mn op qr lalalala
The data table does not show the 'bad' lines. OK, good! However, when I go to write out this data table, tried both write.table and fwrite, those 2 bad lines of /nothing/, and the double-quote, are written out just like they show in the input file! I've tried doing:
dt[complete.cases(dt),],
dt[!apply(dt == "", 1, all),]
to clear out empty data before writing out, but it does nothing. The data table still only shows those 4 entries. Where is R keeping this 'missing' data? How can I clear out that bad data?
I hope this is a 'one-off' bad output from the source (good ol' US Govt!), but I think they saved this from an xls file, which had bad formatting in a column, causing the text file to contain this mistake, but they obviously did not check the output.
Upvotes: 2
Views: 93
Reputation: 373
After sitting back and thinking through the reading functions, because that column (cd) data is quoted, there's actually two newline characters at the end of the string, which is not shown in the data table element! So writing out that element will result in writing those two line breaks. All I needed to do was:
dt$cd <- gsub("[\r\n","",dt$cd)
and that fixed it, the output written to file now has correct rows of data. I wish I could remove my question...but maybe someday someone will come across the same "issue". I should have stepped back and thought about it before posting the question.
Upvotes: 1