Calculating the percent occurrence of multiple table columns in r

Question

I have a table that looks as follows - A number of businesses, a number of practices (P1 - P4) that they might engage in, and a column called 'Peer' which is a grouping column (yes - peer, no - no peer). All practice columns are yes/no:

Business    P1  P2  P3  P4  Peer
a   y   n   y   n   y
b   n   n   y   n   n
c   n   y   y   n   y
d   y   y   y   y   n
e   n   n   n   y   y
f   n   n   n   y   n
g   y   y   n   n   y
h   n   y   n   n   n
i   n   n   y   n   y
j   y   n   y   y   n
k   n   y   y   y   n
l   n   y   y   y   n
m   y   n   n   n   n
n   n   n   n   n   y
o   n   y   n   n   y
p   y   y   n   y   n
q   n   n   y   y   y
r   n   n   y   y   n
s   y   y   y   n   y
t   n   y   y   n   y
u   n   n   n   n   y
v   y   n   n   y   n
w   n   y   n   y   n
x   n   y   n   y   y
y   y   n   y   n   n
z   n   n   y   n   y

What I want to achieve from this is a summary table that shows the following, where percentage is the number of businesses which have taken up the practice (practice == "y"):

Peer    Practice    percentage
y   P1  30
y   P2  40
y   P3  33
y   P4  90
n   P1  20
n   P2  30
n   P3  11
n   P4  75

Note I have not calculated the percentages in the second table, they are made up.

While I think this should be achievable with data.table (this question is particularly relevant) I can only work out what seems to me an over-elaborate way to create this table. Example for P1:

Firstly, calculate the percentage occurrences in each category for this practice:

P1Stats <- setDT(dtUptake)[order(Peer), .(P1 = unique(P1), percentage = 100 * tabulate(P1)/.N), by = Peer]

   Peer P1 percentage
1:    n  n   53.84615
2:    n  y   46.15385
3:    y  y   76.92308
4:    y  n   23.07692

From the results, only select the rows where the uptake of P1 is "y":

P1Stats <- P1Stats[P1=="y"]


   Peer P1 percentage
1:    n  y   46.15385
2:    y  y   76.92308

Change the name of the column P1 to Practice

P1Stats <- P1Stats[, .(Peer, Practice = P1, percentage)]

   Peer Practice percentage
1:    n        y   46.15385
2:    y        y   76.92308

Replace the y value in the Practice column with the practice name P1:

P1Stats[, Practice := "P1"]

   Peer Practice percentage
1:    n       P1   46.15385
2:    y       P1   76.92308

Then, if I carry this out for each of my practices I can create the final table by pulling them all together. I understand that I can amalgamate the three top steps into one, and then do the update process after that:

P1Stats <- setDT(dtUptake)[order(Peer), .(P1 = unique(P1), percentage = 100 * tabulate(P1)/.N), by = Peer][P1=="y"][, .(Peer, Practice = P1, percentage)]

P1Stats[, Practice := "P1"]

However, this is still not very elegant. I would like to know if there is a way to make this process more streamlined?

IBrum · Accepted Answer

Using melt and data.table's by parameter, you can achieve that. For instance, supposing your original data.table is in variable dt here:

dt2 = melt.data.table(dt, id.vars=c('Business', 'Peer'), variable.name='practice')
dt2[,.(percentage=sum(value=='y')/.N*100), by=c('Peer','practice')]

The by operation will group on values of a column, or set of columns. As the original data has those values that are intended for grouping (P1,P2,P3,P4) not in column values, but spread in different column names, the long format produced by melt is the way to go.

Calculating the percent occurrence of multiple table columns in r

Answers (1)

Related Questions