pachadotdev
pachadotdev

Reputation: 3775

Read delimited file where semicolon appears as separator and in strings

I'm trying to read a file where some rows contain extra semicolons inside text strings (I have no idea what caused this)

As an example this is a super simplified data with the same problem:

bad_data <- "100; Mc Donalds; Seattle; normal day
             115; Starbucks; Boston; normal day
             400; PF Chang; Chicago; busy day
             400;; Texas; busy day
             10; D;unkin Donuts; Washin;gton; lazy day"

So it has no header and I try to read it with:

library(data.table)
fread(bad_data, sep = ";", header = F, na.strings = c("", NA), strip.white = T)

But no cigar... this is kinda impossible to read and I'd like to just skip those rows if there is no clean solution.

Upvotes: 2

Views: 210

Answers (2)

Walker in the City
Walker in the City

Reputation: 587

You could try removing all of the semicolons that are inside text strings (this assumes that all of the unwanted semicolons are fully inside a string:

gsub("(\\S);(\\S)", "\\1\\2", bad_data, perl=TRUE)
[1] "100; Mc Donalds; Seattle; normal day\n             115; Starbucks; Boston; normal day\n             400; PF Chang; Chicago; busy day\n             400; Texas; busy day\n             10; Dunkin Donuts; Washington; lazy day"

Upvotes: 1

arvi1000
arvi1000

Reputation: 9592

If you just want to drop the lines where you don't have the expected number of delimiters:

library(stringi)
library(magrittr)

bad_data <- 
"100; Mc Donalds; Seattle; normal day
115; Starbucks; Boston; normal day
400; PF Chang; Chicago; busy day
400;; Texas; busy day
10; D;unkin Donuts; Washin;gton; lazy day"

# split to lines. you could also use readLines if it's coming from a file
text_lines <- unlist(strsplit(bad_data, '\n'))

# which lines contain the expected number of semicolons?
good_lines <- sapply(text_lines, function(x) stri_count_fixed(x, ';') == 3)

# for those lines, split to vectors and (optional bonus) trim whitespace
good_vectors <- lapply(
  text_lines[good_lines], 
  function(x) x %>% strsplit(';') %>% unlist %>% trimws)

# flatten to matrix (from which you can make a data.frame or whatever you want)
my_mat <- do.call(rbind, good_vectors)

result:

> my_mat
     [,1]  [,2]         [,3]      [,4]        
[1,] "100" "Mc Donalds" "Seattle" "normal day"
[2,] "115" "Starbucks"  "Boston"  "normal day"
[3,] "400" "PF Chang"   "Chicago" "busy day"  
[4,] "400" ""           "Texas"   "busy day"  

Upvotes: 1

Related Questions