Reputation: 3775
I'm trying to read a file where some rows contain extra semicolons inside text strings (I have no idea what caused this)
As an example this is a super simplified data with the same problem:
bad_data <- "100; Mc Donalds; Seattle; normal day
115; Starbucks; Boston; normal day
400; PF Chang; Chicago; busy day
400;; Texas; busy day
10; D;unkin Donuts; Washin;gton; lazy day"
So it has no header and I try to read it with:
library(data.table)
fread(bad_data, sep = ";", header = F, na.strings = c("", NA), strip.white = T)
But no cigar... this is kinda impossible to read and I'd like to just skip those rows if there is no clean solution.
Upvotes: 2
Views: 210
Reputation: 587
You could try removing all of the semicolons that are inside text strings (this assumes that all of the unwanted semicolons are fully inside a string:
gsub("(\\S);(\\S)", "\\1\\2", bad_data, perl=TRUE)
[1] "100; Mc Donalds; Seattle; normal day\n 115; Starbucks; Boston; normal day\n 400; PF Chang; Chicago; busy day\n 400; Texas; busy day\n 10; Dunkin Donuts; Washington; lazy day"
Upvotes: 1
Reputation: 9592
If you just want to drop the lines where you don't have the expected number of delimiters:
library(stringi)
library(magrittr)
bad_data <-
"100; Mc Donalds; Seattle; normal day
115; Starbucks; Boston; normal day
400; PF Chang; Chicago; busy day
400;; Texas; busy day
10; D;unkin Donuts; Washin;gton; lazy day"
# split to lines. you could also use readLines if it's coming from a file
text_lines <- unlist(strsplit(bad_data, '\n'))
# which lines contain the expected number of semicolons?
good_lines <- sapply(text_lines, function(x) stri_count_fixed(x, ';') == 3)
# for those lines, split to vectors and (optional bonus) trim whitespace
good_vectors <- lapply(
text_lines[good_lines],
function(x) x %>% strsplit(';') %>% unlist %>% trimws)
# flatten to matrix (from which you can make a data.frame or whatever you want)
my_mat <- do.call(rbind, good_vectors)
result:
> my_mat
[,1] [,2] [,3] [,4]
[1,] "100" "Mc Donalds" "Seattle" "normal day"
[2,] "115" "Starbucks" "Boston" "normal day"
[3,] "400" "PF Chang" "Chicago" "busy day"
[4,] "400" "" "Texas" "busy day"
Upvotes: 1