Stefan S
Stefan S

Reputation: 647

Calculate the percentages of a column in a data frame - "grouped" by column

I am an R beginner and came to the point, where I need the possibility to calculate percentages of values in a data frame, but "grouped" by an other column value, quite often.

I have a data frame with around 1000 rows, containing mediatype, version, collection (= year) and count (for this year). I can filter them, to get only a specific mediatye:

trSpdf <- trS[trS$Mediatype == 'application/pdf',]

and get the following exemplary output:

> trSpdf 

        Mediatype Version Collection      Count
39 application/pdf      -1     co2008         2.0
40 application/pdf      -1     co2009         5.0
43 application/pdf       1     co2008         1.0
44 application/pdf       1     co2009         1.0
48 application/pdf     1.1     co2008        16.0
52 application/pdf     1.2     co2008        20.0
53 application/pdf     1.2     co2009        90.0
... (continuing) ...

What I want, is to calculate the percentage of each version for each collection (= year) compared to all versions in this collection, so for this example the result should be:

5.12% of all versions in co2008 were version -1 (2.0 / total sum for co2008)
2.56% of all versions in co2008 were version 1 (1.0 / total sum for co2008)
...
93,75% of all versions in co2009 were version 1.2 (90.0 / total sum for co2009)
...

Thanks in advance for any answers on how I could solve this.

Upvotes: 1

Views: 4632

Answers (2)

flodel
flodel

Reputation: 89097

First, use ave to add a column giving the total count per Mediatype and Collection:

trS <- transform(trS, Tot.Count = ave(Count, Mediatype, Collection, FUN = sum))

Then, it is easy pretty obvious how to compute the percentage:

trS <- transform(trS, percentage = 100 * Count/Tot.Count)

Or if you want it nicely formatted (e.g. "5.13%") then use sprintf:

trS <- transform(trS, percentage = paste0(sprintf("%.2f", 100 * Count/Tot.Count),
                                          "%"))

Upvotes: 1

Thomas
Thomas

Reputation: 44555

You can do something like the following:

with(trSpdf, by(Version, Collection, FUN= function(x) round(prop.table(table(x))*100,2)))

You could change FUN to be whatever you wanted your output to look like.

EDIT: Try this:

yearsums <- with(trSpdf, tapply(Count, Collection, sum))
mapply(FUN = function(x,y) x/yearsums[as.character(y)], trSpdf$Count, trSpdf$Collection)

I'm sure there is a better way, though.

Upvotes: 1

Related Questions