Reputation: 43
I have tried to get a frequency table for one dataset ("sim") using the intervals and classes from another dataset ("obs") (both of the same type). I've tried using the table () function in R, but it doesn't give me the frequency of the dataset called "sim" using the "obs" intervals. There may be data that falls outside the range defined with "obs", the idea is that those are omitted. Is there a simple way to get the frequency table for this case?
Here is a sample of my data (vector):
X obs sim
1 1 11.2 8.44
2 2 22.5 15.51
3 3 26.0 20.08
4 4 28.1 23.57
5 5 29.0 26.46
6 6 29.5 28.95
...etc...
I leave you the lines of code:
# Set working directory
setwd("C:/Users/...")
# Vector has 2 set of data, "obs" and "sim"
vector <- read.csv("vector.csv", fileEncoding = 'UTF-8-BOM')
# Divide the range of "obs" into intervals, using Sturges for number of classes:
factor_obs <- cut(vector$obs, breaks=nclass.Sturges(vector$obs), include.lowest = T)
# Get a frequency table using the table() function for "obs"
obs_out <- as.data.frame(table(factor_obs))
obs_out <- transform(obs_out, cumFreq = cumsum(Freq), relative = prop.table(Freq))
# Get a frequency table using the table() function for "sim", using cut from "obs"
sim_out <- as.data.frame(table(factor_obs, vector$sim > 0))
This is what I get from "obs" frequency table:
> obs_out
factor_obs Freq cumFreq relative
1 [11.1,25.6] 2 2 0.04166667
2 (25.6,40.1] 10 12 0.20833333
3 (40.1,54.5] 17 29 0.35416667
4 (54.5,69] 4 33 0.08333333
5 (69,83.4] 8 41 0.16666667
6 (83.4,97.9] 5 46 0.10416667
7 (97.9,112] 2 48 0.04166667
This is what I get from "sim" frequency table:
> sim_out
factor_obs Var2 Freq
1 [11.1,25.6] TRUE 2
2 (25.6,40.1] TRUE 10
3 (40.1,54.5] TRUE 17
4 (54.5,69] TRUE 4
5 (69,83.4] TRUE 8
6 (83.4,97.9] TRUE 5
7 (97.9,112] TRUE 2
Which is the same frequency from the table for "obs". The idea is that the elements of "sim" in each interval defined by the classes of "obs" are counted, and that extreme values outside the ranges of "obs" are omitted.
It would be helpful if someone can guide me. Thanks a lot!!
Upvotes: 1
Views: 583
Reputation: 11046
You will need to define your own breakpoints since if you let cut
do it, the values are not saved for you to use with the sim
variable. First use dput(vector)
to put the data in a simple form for R:
vector <- structure(list(X = 1:48, obs = c(11.2, 22.5, 26, 28.1, 29, 29.5,
30.8, 32, 33.5, 35, 35.5, 38.9, 41, 41, 41, 43, 43.51, 44, 46,
48.5, 50, 50, 50, 50, 50.8, 51.5, 51.5, 53, 54.4, 55, 57.5, 59.5,
66.9, 70.6, 74.2, 75, 77, 80.2, 81.5, 82, 83, 83.6, 85, 85.1,
93.8, 94, 106.7, 112.3), sim = c(8.44, 15.51, 20.08, 23.57, 26.46,
28.95, 31.16, 33.17, 35.02, 36.75, 38.37, 39.92, 41.39, 42.81,
44.19, 45.52, 46.82, 48.09, 49.34, 50.56, 51.78, 52.98, 54.18,
55.37, 56.55, 57.75, 58.94, 60.14, 61.36, 62.59, 63.83, 65.1,
66.4, 67.74, 69.11, 70.53, 72.01, 73.55, 75.18, 76.9, 78.75,
80.76, 82.98, 85.46, 88.35, 91.84, 96.41, 103.48)), class = "data.frame",
row.names = c(NA, -48L))
Now we need the number of categories and the breakpoints:
nbreaks <- nclass.Sturges(vector$obs)
minval <- min(vector$obs)
maxval <- max(vector$obs)
int <- round((maxval - minval) / nbreaks, 3) # round to 1 digit more thab obs or sim
brks <- c(minval, minval + seq(nbreaks-1) * int, maxval)
The table for the obs
data:
factor_obs <- cut(vector$obs, breaks=brks, include.lowest=TRUE)
obs_out <- transform(table(factor_obs), cumFreq = cumsum(Freq), relative = prop.table(Freq))
print(obs_out, digits=3)
# factor_obs Freq cumFreq relative
# 1 [11.2,25.6] 2 2 0.0417
# 2 (25.6,40.1] 10 12 0.2083
# 3 (40.1,54.5] 17 29 0.3542
# 4 (54.5,69] 4 33 0.0833
# 5 (69,83.4] 8 41 0.1667
# 6 (83.4,97.9] 5 46 0.1042
# 7 (97.9,112] 2 48 0.0417
Now the sim
data:
factor_sim <- cut(vector$sim, breaks=brks, include.lowest=TRUE)
sim_out <- transform(table(factor_sim), cumFreq = cumsum(Freq), relative = prop.table(Freq))
print(sim_out, digits=3)
# factor_sim Freq cumFreq relative
# 1 [11.2,25.6] 3 3 0.0638
# 2 (25.6,40.1] 8 11 0.1702
# 3 (40.1,54.5] 11 22 0.2340
# 4 (54.5,69] 11 33 0.2340
# 5 (69,83.4] 9 42 0.1915
# 6 (83.4,97.9] 4 46 0.0851
# 7 (97.9,112] 1 47 0.0213
Notice there are only 47 cases shown instead of 48 since one value is less then the minimum.
addmargins(table(factor_obs, factor_sim, useNA="ifany"))
# factor_sim
# factor_obs [11.2,25.6] (25.6,40.1] (40.1,54.5] (54.5,69] (69,83.4] (83.4,97.9] (97.9,112] <NA> Sum
# [11.2,25.6] 1 0 0 0 0 0 0 1 2
# (25.6,40.1] 2 8 0 0 0 0 0 0 10
# (40.1,54.5] 0 0 11 6 0 0 0 0 17
# (54.5,69] 0 0 0 4 0 0 0 0 4
# (69,83.4] 0 0 0 1 7 0 0 0 8
# (83.4,97.9] 0 0 0 0 2 3 0 0 5
# (97.9,112] 0 0 0 0 0 1 1 0 2
# Sum 3 8 11 11 9 4 1 1 48
Upvotes: 1