Reputation: 10322
I hope this is not a dupe - I have searched long and hard and have found many similar questions but nothing that addresses my issue.
I have a dataframe, 1 column containing data and the other 2 columns are quantile measures I have calculated earlier.
set.seed(123)
d <- data.frame(data = 100:199, quantile1 = runif(100), quantile2 = runif(100))
head(d)
data quantile1 quantile2
1 100 0.2875775 0.5999890
2 101 0.7883051 0.3328235
3 102 0.4089769 0.4886130
4 103 0.8830174 0.9544738
5 104 0.9404673 0.4829024
6 105 0.0455565 0.8903502
I would like a smart way split the data according to quantile1 and quantile2 columns. e.g. I want the rows where quantile1 is < 0.25 and quantile2 is > 0.5. I am currently doing this with:
d[d[,2] < 0.25 & d[,3] > 0.5, ]
This works but is a bit of a hassle if I have many subsets I want to split my data into. I am looking at splitting at the following boundaries:
quantile1 0.25, 0.75 (three subsets)
and then for each subset of quantile1, further split the boundary:
quantile2 0.5 (2 subsets)
so yielding 6 subsets in total.
Thanks.
Upvotes: 2
Views: 2203
Reputation: 121568
I prefer to use plyr for splitting, because it's very eco-friendly and prepare data for future plot with ggplot2.
1) bin quatile1,quantile2 groups of similar size using cut
dat$qt1 <- cut(dat$quantile1,c(0,0.25,0.75,1),include.lowest=TRUE)
dat$qt2 <- cut(dat$quantile2,c(0,0.5,1),include.lowest=TRUE)
2) I use melt to reshape data using my discrete identifiers(qt1,qt2)
library(reshape2)
mm <- melt(dat,measure.vars='data')
Now I can play with my melted data , e.g:
acast(mm,qt1~qt2)
Aggregation function missing: defaulting to length
[0,0.5] (0.5,1]
[0,0.25] 9 18
(0.25,0.75] 26 22
(0.75,1] 18 7
or I can plot the data using ggplot:
library(ggplot2)
ggplot(data=mm)+geom_bar(aes(x=qt1,fill=qt2,group=qt2),position='dodge')
Upvotes: 1
Reputation: 17412
Someone may come up with a more elegant solution, but this is what I've done in similar situations:
> split(d, list(cut(d[,2], c(0,.25,.75,1)), cut(d[,3], c(0,.5,1)))) -> NewD
> NewD # Shows the six tables
> lapply(NewD, nrow) # Shows the name/size of each resulting data frame
$`(0,0.25].(0,0.5]`
[1] 9
$`(0.25,0.75].(0,0.5]`
[1] 27
$`(0.75,1].(0,0.5]`
[1] 14
$`(0,0.25].(0.5,1]`
[1] 17
$`(0.25,0.75].(0.5,1]`
[1] 20
$`(0.75,1].(0.5,1]`
[1] 13
The split
function creates new data frames within a list
based on the criteria in the second argument (which in this case is a list). The cut
function divides a vector into user specified intervals (or number of equally spaced intervals if you want).
You can rename these data frames with something like names(NewD) <- c("A", "B", "C", "D", "E", "F")
.
Upvotes: 2
Reputation: 193507
Try using split
and findInterval
together, perhaps something like:
dsplit <- split(d, list(findInterval(d[, "quantile1"], vec=c(0, .25, .75)),
findInterval(d[, "quantile2"], vec=c(0, .5))))
This creates a list
of 6 data.frame
s. The first three data.frame
s are those where "quantile2" is less than .5, and the second three are those where it is greater than 5.
str(dsplit)
# List of 6
# $ 1.1:'data.frame': 9 obs. of 3 variables:
# ..$ data : int [1:9] 139 140 145 146 153 155 161 190 195
# ..$ quantile1: num [1:9] 0.232 0.143 0.139 0.233 0.122 ...
# ..$ quantile2: num [1:9] 0.439 0.312 0.231 0.239 0.246 ...
# $ 2.1:'data.frame': 27 obs. of 3 variables:
# ..$ data : int [1:27] 102 108 109 111 112 121 122 124 126 127 ...
# ..$ quantile1: num [1:27] 0.409 0.551 0.457 0.453 0.678 ...
# ..$ quantile2: num [1:27] 0.4886 0.4107 0.1471 0.3012 0.0607 ...
# $ 3.1:'data.frame': 14 obs. of 3 variables:
# ..$ data : int [1:14] 101 104 115 119 123 152 157 158 164 167 ...
# ..$ quantile1: num [1:14] 0.788 0.94 0.9 0.955 0.994 ...
# ..$ quantile2: num [1:14] 0.333 0.483 0.142 0.405 0.22 ...
# $ 1.2:'data.frame': 17 obs. of 3 variables:
# ..$ data : int [1:17] 105 114 116 117 129 134 137 144 150 156 ...
# ..$ quantile1: num [1:17] 0.0456 0.1029 0.2461 0.0421 0.1471 ...
# ..$ quantile2: num [1:17] 0.89 0.721 0.549 0.954 0.69 ...
# $ 2.2:'data.frame': 20 obs. of 3 variables:
# ..$ data : int [1:20] 100 106 113 118 125 132 135 138 160 162 ...
# ..$ quantile1: num [1:20] 0.288 0.528 0.573 0.328 0.709 ...
# ..$ quantile2: num [1:20] 0.6 0.914 0.948 0.585 0.984 ...
# $ 3.2:'data.frame': 13 obs. of 3 variables:
# ..$ data : int [1:13] 103 107 110 120 130 131 133 136 149 166 ...
# ..$ quantile1: num [1:13] 0.883 0.892 0.957 0.89 0.963 ...
# ..$ quantile2: num [1:13] 0.954 0.609 0.935 0.648 0.619 ...
You can verify the desired output according to your example in your question.
dsplit[[4]]
# data quantile1 quantile2
# 6 105 0.0455564994 0.8903502
# 15 114 0.1029246827 0.7205963
# 17 116 0.2460877344 0.5492847
# 18 117 0.0420595335 0.9540912
# 30 129 0.1471136473 0.6900071
# 35 134 0.0246136845 0.5211357
# 38 137 0.2164079358 0.7862816
# 45 144 0.1524447477 0.8427293
# 51 150 0.0458311667 0.8474532
# 57 156 0.1275316502 0.5719353
# 74 173 0.0006247733 0.7465680
# 76 175 0.2201188852 0.6180179
# 80 179 0.1111354243 0.5817501
# 81 180 0.2436194727 0.8397678
# 85 184 0.1028646443 0.5943432
# 90 189 0.1750526503 0.9018744
# 98 197 0.0935949867 0.6592303
yourexample <- d[d[, 2] < 0.25 & d[,3] > 0.5, ]
identical(dsplit[[4]], yourexample)
# [1] TRUE
Upvotes: 4