Reputation: 139
I'm trying to use fread()
to get some data from a website. The data is conveniently set up with comma separators, but I get the error:
1: In fread("https://website.com/") :
Stopped early on line 56. Expected 5 fields but found 6. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<0,1,1,x[[0], [1]],0>>
This is because the entries before line 56 had a blank on column 4, so something like <<1,1,1,0>>
, whereas line 56 has something including a comma on column 4, so it splits it into two columns. Now, I want that the whole x[[y], [z]]
to be in one cell, so I want that my data is separated by a comma, but not when commas are inside square brackets.
Edit: The real website is private, so it makes no sense to link it here, but it simply contains data in a csv format. Something like:
field1,field2,field3,field4,field5
1,0,0,,1
0,0,0,,1
1,1,0,,1
1,1,0,,1
............
0,1,1,x[[0], [1]],0
0,1,0,x[[0], [1]],1
1,0,1,,1
0,0,1,x[[1], [0]],0
............
The problem arises with the fact that x[[0], [1]]
is supposed to be all in one cell, but because of the comma delimiter it is split across two cells.
Is there any way to do this with fread()?
Or with any other function that serves a similar purpose?
Thank you in advance and sorry if the question is somewhat basic, I'm just getting started with R.
Upvotes: 1
Views: 383
Reputation: 3175
Instead of reading your CSV file directly from that private website of yours with fread
, you can download the CSV first and then:
csv_lines <- read_lines(my_weird_csv_text)
;"(?!\\])(\\,)(?!\\s\\[)"
as opposed to using the single comma ","
(this ensures that commas within those expressions with "[["
and "]]"
are not use as split characters);split_lines
) define the column names of a new dataframe/tibble that has been coerced from split_lines
.I hope it's clear.
Basically, we had to circumvent straightforward reading functions such as fread
or other equivalent by reading line by line and then doing split based on a regex that handles your special cases.
library(readr)
library(data.table)
library(stringr)
library(tibble)
my_weird_csv_text <-
"field1,field2,field3,field4,field5
1,0,0,,1
0,0,0,,1
1,1,0,,1
1,1,0,,1
0,1,1,x[[0], [1]],0
0,1,0,x[[0], [1]],1
1,0,1,,1
0,0,1,x[[1], [0]],0"
csv_lines <- read_lines(my_weird_csv_text)
split_lines <- stringr::str_split(csv_lines, "(?!\\])(\\,)(?!\\s\\[)", simplify = TRUE)
as_tibble(split_lines[-1, ]) %>%
`colnames<-`(split_lines[1, ]) -> tbl
tbl
#> # A tibble: 8 x 5
#> field1 field2 field3 field4 field5
#> <chr> <chr> <chr> <chr> <chr>
#> 1 1 0 0 "" 1
#> 2 0 0 0 "" 1
#> 3 1 1 0 "" 1
#> 4 1 1 0 "" 1
#> 5 0 1 1 x[[0], [1]] 0
#> 6 0 1 0 x[[0], [1]] 1
#> 7 1 0 1 "" 1
#> 8 0 0 1 x[[1], [0]] 0
Upvotes: 1
Reputation: 1392
A suggestion:
From the documentation:
'fread' is for regular delimited files; i.e., where every row has the same number of
columns.
If the number of columns varies or is irregular because of errors in file generation, an alternative like readLines
would enable you to process the file line by line--perhaps, using regular expressions like gsub
, etc.
Upvotes: 1