stats134711
stats134711

Reputation: 636

Obtain endpoints from interval that is factor variable

Setup I sample 1,000,000 observations from the following normal mixture model and bin the observations such that each of the 10,000 bin has an equal number of observations (i.e. 100). This creates a factor for each bin in the form (a,b], where a and b are numbers.

#Random sample
set.seed(1234)
X = ks::rnorm.mixt(n=1000000,mus=c(0.2,0.8),sigmas=c(0.04,0.01),props=c(0.95,0.05))

#Bins based on random sample with ~100 observations in each bins
bins = ggplot2::cut_number(X,10000)

dat = data.frame(X,bins)

Question I would like to extract the numbers a and b from the factor (a,b]. Here is what the bins look like:

> head(table(bins))
bins
[0.00501617,0.0518875]  (0.0518875,0.0594831]  (0.0594831,0.0640679] 
                   100                    100                    100 
 (0.0640679,0.0670062]  (0.0670062,0.0694194]  (0.0694194,0.0717924] 
                   100                    100                    100 
> tail(table(bins),20)
bins
(0.817766,0.818032]   (0.818032,0.8183]   (0.8183,0.818544] (0.818544,0.818879] 
                100                 100                 100                 100 
(0.818879,0.819112] (0.819112,0.819394] (0.819394,0.819664] (0.819664,0.819979] 
                100                 100                 100                 100 
(0.819979,0.820328] (0.820328,0.820727] (0.820727,0.821118]  (0.821118,0.82158] 
                100                 100                 100                 100 
 (0.82158,0.822109] (0.822109,0.822646] (0.822646,0.823253]  (0.823253,0.82408] 
                100                 100                 100                 100 
 (0.82408,0.825026] (0.825026,0.826417] (0.826417,0.828651]  (0.828651,0.84424] 
                100                 100                 100                 100 

As you can see, the numbers in the factors don't always have the same number of digits and they may be preceded by 0's (e.g. (0.0518875,0.0594831]).

I initially tried to extract just the numeric portion using

endpts=na.omit(as.numeric(unlist(strsplit(as.character(unlist(bins)),"[^0-9]+"))))

For the above bin ((0.0518875,0.0594831]), this procedure would output 518875 594831, but because the trailing zeros are gone, it could be mapped to several values (e.g. 0.518875 0.594831). Furthermore, there are bins in which one or both of the numbers have different number of digits (e.g. (0.818032,0.8183]). This lack of uniformity in the output is giving me problems when trying to get the endpoints. Ultimately, I'd like to get the left and right endpoints. Any suggestions?

EDIT I also looked into the code for ggplot2::cut_number, which uses the cut function. The default input in cut for the number of digits is dig.lab=3, but this doesn't seem to be reflected in the above output.

Upvotes: 1

Views: 353

Answers (2)

IRTFM
IRTFM

Reputation: 263481

Something along this lightly tested approach:

unique( as.numeric(  unlist( 
                 strsplit( gsub( "[][(]" , "", levels(bins)[1:5] ) , ","))))

I have learned to "read nested R code from the inside-out". This first (1) removes the flanking "(", "[" and "]" using a character class pattern, then (2) splits on commas, (3) "vectorizes" the list structure with unlist, (4)then converts to numeric and finally (5) removes duplicates. This shows it using line breaks for formatting:

unique(                    #     (5)
  as.numeric(                  #     (4)
      unlist(                        #     (3)
            strsplit(                     #     (2)
                gsub( "[][(]" , "", levels(bins)[1:5] ) , ",") # (1)
       )))

This was tested on your example and produces this for a smaller example using the first 5 levels:

unique( as.numeric(  unlist( strsplit( gsub( "[][(]" , "", levels(bins)[1:5] ) , ","))))
[1] 0.00501617 0.05188750 0.05948310 0.06406790 0.06700620 0.06941940

I put the word "vectorizes" in quotes because it's not really the meaning of that word in R terminology, where it refers to operations that return a vector of equal length as its input.

Here's the results of my suggestion to keep the decimal point (period) in the items not used as splitting criteria and comaison with what my code would have delivered. You were not clear about whether you wanted just the unique values or that values for each item:

endpts= na.omit( as.numeric( unlist( strsplit( as.character( unlist(bins)),"[^0-9.]+"))))

 head(endpts)
#[1] 0.216698 0.216709 0.243665 0.243682 0.201100 0.201114
 end2 <- unique( as.numeric(  unlist( strsplit( gsub( "[][(]" , "", levels(bins) ) , ","))))
head(end2)
#[1] 0.00501617 0.05188750 0.05948310 0.06406790 0.06700620 0.06941940
 length(endpts)
#[1] 2000000
 length(end2)
#[1] 10001

Upvotes: 3

JasonWang
JasonWang

Reputation: 2434

I think you can take advantage of the structure (a, b]. I didn't try on the real data but here is my attempt:

s <- c("(0.0518875,0.0594831]", "0.818032,0.8183]")
lapply(strsplit(s, ","), function(x) gsub("\\(|]", "", x))

[[1]]
[1] "0.0518875" "0.0594831" 

[[2]]
[1] "0.818032" "0.8183" 

You can change it to number by as.numeric if you want the number.

Upvotes: 3

Related Questions