Reputation: 636
Setup
I sample 1,000,000
observations from the following normal mixture model and bin the observations such that each of the 10,000
bin has an equal number of observations (i.e. 100
). This creates a factor for each bin in the form (a,b]
, where a
and b
are numbers.
#Random sample
set.seed(1234)
X = ks::rnorm.mixt(n=1000000,mus=c(0.2,0.8),sigmas=c(0.04,0.01),props=c(0.95,0.05))
#Bins based on random sample with ~100 observations in each bins
bins = ggplot2::cut_number(X,10000)
dat = data.frame(X,bins)
Question
I would like to extract the numbers a
and b
from the factor (a,b]
. Here is what the bins look like:
> head(table(bins))
bins
[0.00501617,0.0518875] (0.0518875,0.0594831] (0.0594831,0.0640679]
100 100 100
(0.0640679,0.0670062] (0.0670062,0.0694194] (0.0694194,0.0717924]
100 100 100
> tail(table(bins),20)
bins
(0.817766,0.818032] (0.818032,0.8183] (0.8183,0.818544] (0.818544,0.818879]
100 100 100 100
(0.818879,0.819112] (0.819112,0.819394] (0.819394,0.819664] (0.819664,0.819979]
100 100 100 100
(0.819979,0.820328] (0.820328,0.820727] (0.820727,0.821118] (0.821118,0.82158]
100 100 100 100
(0.82158,0.822109] (0.822109,0.822646] (0.822646,0.823253] (0.823253,0.82408]
100 100 100 100
(0.82408,0.825026] (0.825026,0.826417] (0.826417,0.828651] (0.828651,0.84424]
100 100 100 100
As you can see, the numbers in the factors don't always have the same number of digits and they may be preceded by 0's (e.g. (0.0518875,0.0594831]
).
I initially tried to extract just the numeric portion using
endpts=na.omit(as.numeric(unlist(strsplit(as.character(unlist(bins)),"[^0-9]+"))))
For the above bin ((0.0518875,0.0594831]
), this procedure would output
518875 594831
, but because the trailing zeros are gone, it could be mapped to several values (e.g. 0.518875 0.594831
). Furthermore, there are bins in which one or both of the numbers have different number of digits (e.g. (0.818032,0.8183]
). This lack of uniformity in the output is giving me problems when trying to get the endpoints. Ultimately, I'd like to get the left and right endpoints. Any suggestions?
EDIT I also looked into the code for ggplot2::cut_number
, which uses the cut
function. The default input in cut
for the number of digits is dig.lab=3
, but this doesn't seem to be reflected in the above output.
Upvotes: 1
Views: 353
Reputation: 263481
Something along this lightly tested approach:
unique( as.numeric( unlist(
strsplit( gsub( "[][(]" , "", levels(bins)[1:5] ) , ","))))
I have learned to "read nested R code from the inside-out". This first (1) removes the flanking "(", "[" and "]" using a character class pattern, then (2) splits on commas, (3) "vectorizes" the list structure with unlist
, (4)then converts to numeric and finally (5) removes duplicates. This shows it using line breaks for formatting:
unique( # (5)
as.numeric( # (4)
unlist( # (3)
strsplit( # (2)
gsub( "[][(]" , "", levels(bins)[1:5] ) , ",") # (1)
)))
This was tested on your example and produces this for a smaller example using the first 5 levels:
unique( as.numeric( unlist( strsplit( gsub( "[][(]" , "", levels(bins)[1:5] ) , ","))))
[1] 0.00501617 0.05188750 0.05948310 0.06406790 0.06700620 0.06941940
I put the word "vectorizes" in quotes because it's not really the meaning of that word in R terminology, where it refers to operations that return a vector of equal length as its input.
Here's the results of my suggestion to keep the decimal point (period) in the items not used as splitting criteria and comaison with what my code would have delivered. You were not clear about whether you wanted just the unique values or that values for each item:
endpts= na.omit( as.numeric( unlist( strsplit( as.character( unlist(bins)),"[^0-9.]+"))))
head(endpts)
#[1] 0.216698 0.216709 0.243665 0.243682 0.201100 0.201114
end2 <- unique( as.numeric( unlist( strsplit( gsub( "[][(]" , "", levels(bins) ) , ","))))
head(end2)
#[1] 0.00501617 0.05188750 0.05948310 0.06406790 0.06700620 0.06941940
length(endpts)
#[1] 2000000
length(end2)
#[1] 10001
Upvotes: 3
Reputation: 2434
I think you can take advantage of the structure (a, b]
. I didn't try on the real data but here is my attempt:
s <- c("(0.0518875,0.0594831]", "0.818032,0.8183]")
lapply(strsplit(s, ","), function(x) gsub("\\(|]", "", x))
[[1]]
[1] "0.0518875" "0.0594831"
[[2]]
[1] "0.818032" "0.8183"
You can change it to number by as.numeric
if you want the number.
Upvotes: 3