Reputation: 6755
Not sure how to domesticate ddply
here by summarising my gender counts for countries. I have this data frame
df <- data.frame(country = c("Italy", "Germany", "Italy", "USA","Poland"),
gender = c("male", "female", "male", "female", "female"))
And I want a dataframe where each row details how many males and females each country has. Yet
ddply(df,~country,table)
country female male
1 Germany 1 0
2 Germany 0 0
3 Germany 0 0
4 Germany 0 0
5 Italy 0 0
6 Italy 0 2
7 Italy 0 0
8 Italy 0 0
9 Poland 0 0
10 Poland 0 0
11 Poland 1 0
12 Poland 0 0
13 USA 0 0
14 USA 0 0
15 USA 0 0
16 USA 1 0
although it produces the desired result it also adds three extra line for each group. Why?
Upvotes: 0
Views: 112
Reputation: 99361
Since you're already in plyr
, why not use the count
function?
> library(plyr)
> count(df)
# country gender freq
# 1 Germany female 1
# 2 Italy male 2
# 3 Poland female 1
# 4 USA female 1
Or in base R, a table
> ( tb <- table(df) )
# gender
# country female male
# Germany 1 0
# Italy 0 2
# Poland 1 0
# USA 1 0
ADDED: Per OPs comment below, to turn the above table into a data frame, you can manipulate, use, and change its attributes.
> as.data.frame(cbind(country = rownames(tb), unclass(tb)),
row.names = "NULL")
# country female male
# 1 Germany 1 0
# 2 Italy 0 2
# 3 Poland 1 0
# 4 USA 1 0
Upvotes: 0
Reputation: 1568
It looks like you wanted simply
as.data.frame.matrix(table(df))
Thanks to: How to convert a table to a data frame
But to answer your question about why you got the output you did...
table
is based on factor levels, not on the values in your vector. So if you run
df[df$country=="Germany",]$country
[1] Germany
Levels: Germany Italy Poland USA
You can see that after subsetting, the country vector still has all four levels, but only one value. Then when you run table
, it summarizes for each of those levels, even if they are not in the vector.
table(df[df$country=="Germany",])
gender
country female male
Germany 1 0
Italy 0 0
Poland 0 0
USA 0 0
When debugging ddply
, always try out your function on one of the subsets it will create from your data.
Upvotes: 0
Reputation: 6755
I found this solution. Not sure is the most elegant.
df <- data.frame(country = c("Italy", "Germany", "Italy", "USA","Poland"),
gender = c("male", "female", "male", "female", NA))
ddply(df, .(country), summarise,
female=sum(gender=="female",na.rm = TRUE),
male=sum(gender=="male", na.rm = TRUE),
na=sum(is.na(gender)))
Upvotes: 0