Reputation: 3768
I have a text file containing:
Tue Feb 11 12:19:39 +0000 2014
Tue Feb 11 12:19:56 +0000 2014
Tue Feb 11 12:20:04 +0000 2014
and i read it into r
dataset <- read.csv("Time.txt")
and in order for R to recognise the timestamps in the file, i write:
time <- strptime(dataset[,1], format = "%a %b %d %H:%M:%S %z %Y")
and whenever i try to plot a histogram with:
hist(time, breaks = 100)
it produces an error together with a generated histogram
In breaks[-1L] + breaks[-nB] : NAs produced by integer overflow
What could be the issue that is prompting this error?
Upvotes: 2
Views: 2904
Reputation: 1688
Since you asked what could be causing the error here it is:
The error is created when the hist.default
function calculates the midpoints of the histogram. This vector mids <- 0.5 * (breaks[-1L] + breaks[-nB])
calculates the halfway point between each break. The issue arises because the breaks are generated as integers:
If the argument breaks
is numeric
and length == 1
then the hist.default
function (which is called by hist.POSIXt
) creates a vector of breaks
based on the range of x
and the number of breaks. This is done using the pretty
command. For reasons I have not looked into too closely, if breaks
is small enough that pretty(range(x),n=breaks, min.n = 1)
returns only one of each value e.g.:
pretty(range(x), n = 35, min.n = 1)
#[1] 1392121179 1392121180 1392121181 1392121182 1392121183 1392121184
#[7] 1392121185 1392121186 1392121187 1392121188 1392121189 1392121190
#[13] 1392121191 1392121192 1392121193 1392121194 1392121195 1392121196
#[19] 1392121197 1392121198 1392121199 1392121200 1392121201 1392121202
#[25] 1392121203 1392121204
then the output is an integer
type. If however, the number of breaks is larger and some of the outputs are duplicated:
pretty(range(x), n = 36, min.n = 1)
# [1] 1392121179 1392121180 1392121180 1392121181 1392121181 1392121182
# [7] 1392121182 1392121183 1392121183 1392121184 1392121184 1392121185
#[13] 1392121185 1392121186 1392121186 1392121187 1392121187 1392121188
#[19] 1392121188 1392121189 1392121189 1392121190 1392121190 1392121191
#[25] 1392121191 1392121192 1392121192 1392121193 1392121193 1392121194
#[31] 1392121194 1392121195 1392121195 1392121196 1392121196 1392121197
#[37] 1392121197 1392121198 1392121198 1392121199 1392121199 1392121200
#[43] 1392121200 1392121201 1392121201 1392121202 1392121202 1392121203
#[49] 1392121203 1392121204 1392121204
then the output is numeric
.
Because R uses 32 bit integer types and POSIXt
integers are large numbers, adding two POSIXt
integers results in an overflow that R can't handle and returns NA
. When pretty
returns numeric, this is not a problem.
See also: What is integer overflow in R and how can it happen?
In practice, all this means is that, if you print out the hist
structure returned, all of your mids
values will be NA
but I don't think it actually affects the plotting of the histogram. Thus it is only a warning.
EDIT:
pretty
internally uses seq.int
Upvotes: 4
Reputation: 347
In my environement, it does not generate any errors.
dataset <- read.csv("Time.txt", header = F)
time <- strptime(dataset[,1], format = "%a %b %d %H:%M:%S %z %Y")
hist(as.numeric(time), breaks = 100)
Perhaps if you just convert time into numeric as above, error will disappear. Then, it is straightforward to change the x-axis of the histogram.
EDIT : The ggplot2
should not face this issue and is much simpler and modern :
ggplot(dataset) + geom_histogram(aes(x = V1), stat = "count", bins = 100)
Where V1 is the default name of the unique column of dataset
created by read.csv()
.
Upvotes: 0