CptNemo
CptNemo

Reputation: 6755

Using base::table as argument of plyr::ddply

Not sure how to domesticate ddply here by summarising my gender counts for countries. I have this data frame

df <- data.frame(country = c("Italy", "Germany", "Italy", "USA","Poland"),
                 gender = c("male", "female", "male", "female", "female"))

And I want a dataframe where each row details how many males and females each country has. Yet

ddply(df,~country,table)

   country female male
1  Germany      1    0
2  Germany      0    0
3  Germany      0    0
4  Germany      0    0
5    Italy      0    0
6    Italy      0    2
7    Italy      0    0
8    Italy      0    0
9   Poland      0    0
10  Poland      0    0
11  Poland      1    0
12  Poland      0    0
13     USA      0    0
14     USA      0    0
15     USA      0    0
16     USA      1    0

although it produces the desired result it also adds three extra line for each group. Why?

Upvotes: 0

Views: 112

Answers (3)

Rich Scriven
Rich Scriven

Reputation: 99361

Since you're already in plyr, why not use the count function?

> library(plyr)
> count(df)
#   country gender freq
# 1 Germany female    1
# 2   Italy   male    2
# 3  Poland female    1
# 4     USA female    1

Or in base R, a table

> ( tb <- table(df) )
#          gender
# country   female male
#   Germany      1    0
#   Italy        0    2
#   Poland       1    0
#   USA          1    0

ADDED: Per OPs comment below, to turn the above table into a data frame, you can manipulate, use, and change its attributes.

> as.data.frame(cbind(country = rownames(tb), unclass(tb)),
                row.names = "NULL")
#   country female male
# 1 Germany      1    0
# 2   Italy      0    2
# 3  Poland      1    0
# 4     USA      1    0

Upvotes: 0

waternova
waternova

Reputation: 1568

It looks like you wanted simply

as.data.frame.matrix(table(df))

Thanks to: How to convert a table to a data frame

But to answer your question about why you got the output you did...

table is based on factor levels, not on the values in your vector. So if you run

df[df$country=="Germany",]$country

[1] Germany
Levels: Germany Italy Poland USA

You can see that after subsetting, the country vector still has all four levels, but only one value. Then when you run table, it summarizes for each of those levels, even if they are not in the vector.

table(df[df$country=="Germany",])

         gender
country   female male
  Germany      1    0
  Italy        0    0
  Poland       0    0
  USA          0    0

When debugging ddply, always try out your function on one of the subsets it will create from your data.

Upvotes: 0

CptNemo
CptNemo

Reputation: 6755

I found this solution. Not sure is the most elegant.

df <- data.frame(country = c("Italy", "Germany", "Italy", "USA","Poland"),
                     gender = c("male", "female", "male", "female", NA))

ddply(df, .(country), summarise, 
      female=sum(gender=="female",na.rm = TRUE),
      male=sum(gender=="male", na.rm = TRUE),
      na=sum(is.na(gender)))

Upvotes: 0

Related Questions