Reputation: 1390
I have a badly formatted csv file (I did not make it) that includes both separators and broken quotes in some fields. I would like to read this into R.
Three lines of the table look something like this:
| ids |info | text |
| id 1 |extra_info;1998| text text text |
| id 2 |extra_info2 | text with broken dialogues quotes "hi! |
#the same table in R string could be
string <- "ids;info;text\n\"id 1\";\"extra_info;1998\";\"text text text\"\n\"id 2\";extra_info2;\"text with broken dialogues quotes \"hi!\" \n"
With " quotes surrounding any field with more than one word as is common in csv-s, and semicolon ; used as a separator. Unfortunately the way it was built, the last column (and it is always last), can contain a random number of semicolons or quotes within a text bulk, and these quotes are not always escaped.
I'm looking for a way to read this file. So far I have come up with a really complicated workflow to replace the first N separators with another less used separator when they are in the beginning of line with regex (from here) - because text is always last, however this still fails currently when there is an uneven number of quotes in the line.
I'm thinking there must be an easier way to do this, as badly formed csv-s should be a reoccurring problem here. Thanks.
Upvotes: 0
Views: 603
Reputation: 52268
data.table::fread
works wonders:
library(data.table)
test <- fread("test.csv")
# Remove extraneous columns
test$V1 <- NULL
test$V5 <- NULL
Upvotes: 1