vladi98
vladi98

Reputation: 139

fread() with commas outside square brackets as separators

I'm trying to use fread() to get some data from a website. The data is conveniently set up with comma separators, but I get the error:

1: In fread("https://website.com/") :
Stopped early on line 56. Expected 5 fields but found 6. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<0,1,1,x[[0], [1]],0>>

This is because the entries before line 56 had a blank on column 4, so something like <<1,1,1,0>>, whereas line 56 has something including a comma on column 4, so it splits it into two columns. Now, I want that the whole x[[y], [z]] to be in one cell, so I want that my data is separated by a comma, but not when commas are inside square brackets.

Edit: The real website is private, so it makes no sense to link it here, but it simply contains data in a csv format. Something like:

field1,field2,field3,field4,field5
1,0,0,,1
0,0,0,,1
1,1,0,,1
1,1,0,,1
............
0,1,1,x[[0], [1]],0
0,1,0,x[[0], [1]],1
1,0,1,,1
0,0,1,x[[1], [0]],0
............

The problem arises with the fact that x[[0], [1]] is supposed to be all in one cell, but because of the comma delimiter it is split across two cells.

Is there any way to do this with fread()? Or with any other function that serves a similar purpose?

Thank you in advance and sorry if the question is somewhat basic, I'm just getting started with R.

Upvotes: 1

Views: 383

Answers (2)

Ramiro Magno
Ramiro Magno

Reputation: 3175

Instead of reading your CSV file directly from that private website of yours with fread, you can download the CSV first and then:

  1. Read the lines of the CSV (without any special parsing), that will be equivalent to my csv_lines <- read_lines(my_weird_csv_text);
  2. Then, split those read lines according to the regex "(?!\\])(\\,)(?!\\s\\[)" as opposed to using the single comma "," (this ensures that commas within those expressions with "[[" and "]]" are not use as split characters);
  3. Finally, from the first row of the resulting matrix (split_lines) define the column names of a new dataframe/tibble that has been coerced from split_lines.

I hope it's clear.

Basically, we had to circumvent straightforward reading functions such as fread or other equivalent by reading line by line and then doing split based on a regex that handles your special cases.

library(readr)
library(data.table)
library(stringr)
library(tibble)

my_weird_csv_text <- 
"field1,field2,field3,field4,field5
1,0,0,,1
0,0,0,,1
1,1,0,,1
1,1,0,,1
0,1,1,x[[0], [1]],0
0,1,0,x[[0], [1]],1
1,0,1,,1
0,0,1,x[[1], [0]],0"

csv_lines <- read_lines(my_weird_csv_text)

split_lines <- stringr::str_split(csv_lines, "(?!\\])(\\,)(?!\\s\\[)", simplify = TRUE)

as_tibble(split_lines[-1, ]) %>%
  `colnames<-`(split_lines[1, ]) -> tbl

tbl
#> # A tibble: 8 x 5
#>   field1 field2 field3 field4      field5
#>   <chr>  <chr>  <chr>  <chr>       <chr> 
#> 1 1      0      0      ""          1     
#> 2 0      0      0      ""          1     
#> 3 1      1      0      ""          1     
#> 4 1      1      0      ""          1     
#> 5 0      1      1      x[[0], [1]] 0     
#> 6 0      1      0      x[[0], [1]] 1     
#> 7 1      0      1      ""          1     
#> 8 0      0      1      x[[1], [0]] 0

Upvotes: 1

Edward Carney
Edward Carney

Reputation: 1392

A suggestion:

From the documentation:

'fread' is for regular delimited files; i.e., where every row has the same number of
columns.

If the number of columns varies or is irregular because of errors in file generation, an alternative like readLines would enable you to process the file line by line--perhaps, using regular expressions like gsub, etc.

Upvotes: 1

Related Questions