Reputation: 193657
Can fread
from "data.table" be forced to successfully use "."
as a sep
value?
I'm trying to use fread
to speed up my concat.split
functions in "splitstackshape". See this Gist for the general approach I'm taking, and this question for why I want to make the switch.
The problem I'm running into is treating a dot ("."
) as a value for sep
. Whenever I do so, I get an "unexpected character" error.
The following simplified example demonstrates the problem.
library(data.table)
y <- paste("192.168.1.", 1:10, sep = "")
x1 <- tempfile()
writeLines(y, x1)
fread(x1, sep = ".", header = FALSE)
# Error in fread(x1, sep = ".", header = FALSE) : Unexpected character (
# 192) ending field 2 of line 1
The workaround I have in my current function is to substitute "."
with another character that is hopefully not present in the original data, say "|"
, but that seems risky to me since I can't predict what is in someone else's dataset. Here's the workaround in action.
x2 <- tempfile()
z <- gsub(".", "|", y, fixed=TRUE)
writeLines(z, x2)
fread(x2, sep = "|", header = FALSE)
# V1 V2 V3 V4
# 1: 192 168 1 1
# 2: 192 168 1 2
# 3: 192 168 1 3
# 4: 192 168 1 4
# 5: 192 168 1 5
# 6: 192 168 1 6
# 7: 192 168 1 7
# 8: 192 168 1 8
# 9: 192 168 1 9
# 10: 192 168 1 10
For the purposes of this question, assume that the data are balanced (each line will have the same number of "sep
" characters). I'm aware that using a "."
as a separator is not the best idea, but I'm just trying to account for what other users might have in their datasets, based on other questions I've answered here on SO.
Upvotes: 8
Views: 4714
Reputation: 55410
The issue seams to be related to the numeric value of the text itself:
library(data.table)
y <- paste("Hz.BB.GHG.", 1:10, sep = "")
xChar <- tempfile()
writeLines(y, xChar)
fread(xChar, sep = ".", header = FALSE)
# V1 V2 V3 V4
# 1: Hz BB GHG 1
# 2: Hz BB GHG 2
# 3: Hz BB GHG 3
# 4: Hz BB GHG 4
# 5: Hz BB GHG 5
# 6: Hz BB GHG 6
# 7: Hz BB GHG 7
# 8: Hz BB GHG 8
# 9: Hz BB GHG 9
# 10: Hz BB GHG 10
However, trying with the original value, again gives the same error:
fread(x1, sep = ".", header = FALSE, colClasses="numeric", verbose=TRUE)
fread(x1, sep = ".", header = FALSE, colClasses="character", verbose=TRUE)
Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
Looking for supplied sep '.' on line 10 (the last non blank line in the first 'autostart') ... found ok
Found 4 columns
First row with 4 fields occurs on line 1 (either column names or first row of data)
Error in fread(x1, sep = ".", header = FALSE, colClasses = "character", :
Unexpected character (192.) ending field 2 of line 1
This however, does work:
read.table(x1, sep=".")
# V1 V2 V3 V4
# 1 192 168 1 1
# 2 192 168 1 2
# 3 192 168 1 3
# 4 192 168 1 4
# ... <cropped>
Upvotes: 0
Reputation: 59612
Now implemented in v1.9.5 on GitHub.
> input = paste( paste("192.168.1.", 1:5, sep=""), collapse="\n")
> cat(input,"\n")
192.168.1.1
192.168.1.2
192.168.1.3
192.168.1.4
192.168.1.5
Setting sep='.'
results in ambiguity with the new argument dec
(by default '.'
) :
> fread(input,sep=".")
Error in fread(input, sep = ".") :
The two arguments to fread 'dec' and 'sep' are equal ('.')
Therefore choose something else for dec
:
> fread(input,sep=".",dec=",")
V1 V2 V3 V4
1: 192 168 1 1
2: 192 168 1 2
3: 192 168 1 3
4: 192 168 1 4
5: 192 168 1 5
You may get a warning :
> fread(input,sep=".",dec=",")
V1 V2 V3 V4
1: 192 168 1 1
2: 192 168 1 2
3: 192 168 1 3
4: 192 168 1 4
5: 192 168 1 5
Warning message:
In fread(input, sep = ".", dec = ",") :
Run again with verbose=TRUE to inspect... Unable to change to a locale
which provides the desired dec. You will need to add a valid locale name
to getOption("datatable.fread.dec.locale"). See the paragraph in ?fread.
Either ignore or suppress the warning, or read the paragraph and set the option :
options(datatable.fread.dec.locale = "fr_FR.utf8")
This ensures there can be no ambiguity.
Upvotes: 3