Reputation: 355
I would like to enter a frequency table into an R data.table
.
The data are in a format like this:
Height
Gender 3 35
m 173 125
f 323 198
... where the entries in the table (173, 125, etc.) are counts.
I have a 2 by 2 table, and I want to turn it into two-column data.table
.
The data is from a study of birds who nest at a height. The question is whether different genders of the bird prefer certain heights.
I thought the frequency table should be turned into something like this:
Gender height N
m 3 173
m 35 125
f 3 323
f 35 198
but now I'm not so sure. Some of the models I want to run need every case itemized.
Can I do this conversion in R? Ideally, I'd like a way to switch back and forth between the two formats.
Upvotes: 0
Views: 1367
Reputation: 10841
This is in the form of a contingency table. It isn't easy to enter directly into R but it can be done as follows (based on http://cyclismo.org/tutorial/R/tables.html):
> f <- matrix(c(173,125,323,198),nrow=2,byrow=TRUE)
> colnames(f) <- c(3,35)
> rownames(f) <- c("m","f")
> f <- as.table(f)
> f
3 35
m 173 125
f 323 198
You can then create a count or frequency table with:
> as.data.frame(f)
Var1 Var2 Freq
1 m 3 173
2 f 3 323
3 m 35 125
4 f 35 198
The R Cookbook gives a short function to convert to a table of cases (i.e. a long list of the individual items), as follows:
> countsToCases(as.data.frame(f))
... where:
# Convert from data frame of counts to data frame of cases.
# `countcol` is the name of the column containing the counts
countsToCases <- function(x, countcol = "Freq") {
# Get the row indices to pull from x
idx <- rep.int(seq_len(nrow(x)), x[[countcol]])
# Drop count column
x[[countcol]] <- NULL
# Get the rows from x
x[idx, ]
}
... thus you can convert the data to the format needed by any analysis method from any starting format.
(EDIT)
Another way to read in the contingency table is to start with text like this:
> ss <- " 3 35
+ m 173 125
+ f 323 198"
> read.table(text=ss,row.name=1)
X3 X35
m 173 125
f 323 198
Instead of using text =
, you can also use a file name to read the table from (for example) a CSV file.
Upvotes: 0
Reputation: 355
Thanks, everybody (@simon and @Elin) for the help. I thought I was conducting a poll that would get answers like "start with the 4-row version" or "start with the 719-row version" and you all have given me an entire toolbox of ways to move from one to the other. It's really great, informative, and way more than the inquiry deserves.
I unquestionably need to work harder and get more explicit in forming a question. I see by the -3 rating that this boondoggle has earned, crystallizing the fact that I'm not adding anything to the knowledge base, so will delete the question in order to keep future searchers from finding this. I've had a bad run recently with my questions, and as a former teacher of the year, writer of five books, and PhD statistician, it's extremely embarrassing to have been on Stack Exchange for as long as I have, and stand here with one reputation point. One. That means that my upvotes of your answers don't count for a thing.
That reputation point should be scarlet colored.
Here's what I was getting at: In a book, a common way to express data is in a 2×2 table:
Height
Gender 3 35
M 173 175
F 323 198
My tic-tac-sized mind sees two ways of entering that into a data table:
require(data.table)
GENDER <- c("m","m","f","f")
HEIGHT <- c(3, 35, 3, 35)
N <- c(173, 125, 323, 198)
SANDFLIERS <-data.table(GENDER, HEIGHT, N)
That gives the four-line flat-file/tidy representation of the data:
GENDER HEIGHT N
1: m 3 173
2: m 35 125
3: f 3 323
4: f 35 198
The other option is to make a 719-row data table with 173 male@3ft, 125 male@35 feet, etc. It's not too bad if you use the rep()
command and build your table columns carefully. I hate doing arithmetic, so I leave some of these numbers bare and untotaled.
# I need 173+125 males, and 323+198 females.
# One c(rep()) for "m", one c(rep() for "f", and one c() to merge them
gender <- c(c(rep("m", 173+25)), c(rep("f",(323+198))))
# Same here, except the c() functions are one level 'deeper'. I need two
# sets for males (at heights 3 and 35, 173 and 125 of each, respectively)
# and two sets for females (at heights 3 and 35, 323 and 198 respectively)
heights <-c(c(c(rep(3, 173)), c(rep(35,25))), c(c(rep(3, 323)), c(rep(35,198))))
which, when merged into a data.table
gives 719 rows, one for each observed bird.
1: m 3
2: m 3
3: m 3
4: m 3
5: m 3
---
715: f 35
716: f 35
717: f 35
718: f 35
719: f 35
Now that I have the data in two formats, I start looking for ways to do plots and analyses.
I can get a mosaic plot using the 719-row version, but you can't see it because of my 1-point reputation
mosaicplot(table(sandfliers), COLOR=TRUE, margin, legend=TRUE)
and you can get a balloon plot using the 4-row version
So my question was, for those of you with lots and lots of experience with this sort of thing, do you find the 4-row or the 719-row tables more common. I can change from one to the other, but that's more code to add to the book (again I hear my editor, "You're teaching statistics, not R").
So, as I said at the top, this was just an informal poll on whether one is used more often than the other, or whether beginners are better off with one.
Upvotes: 1
Reputation: 6770
Based on a review of ?table.
Make a data frame (x) with columns for Gender, Height, and Freq which would be your N value.
Convert that to a table by using
tabledata <- xtabs(Freq ~ ., x)
There are a number of base functions that can work with this kind of data, which is obviously much more compact than individual rows.
Also from ?loglin this example using table.
loglin(HairEyeColor, list(c(1, 2), c(1, 3), c(2, 3)))
Upvotes: 1