Reputation: 823
I have a table that looks as follows - A number of businesses, a number of practices (P1 - P4) that they might engage in, and a column called 'Peer' which is a grouping column (yes - peer, no - no peer). All practice columns are yes/no:
Business P1 P2 P3 P4 Peer
a y n y n y
b n n y n n
c n y y n y
d y y y y n
e n n n y y
f n n n y n
g y y n n y
h n y n n n
i n n y n y
j y n y y n
k n y y y n
l n y y y n
m y n n n n
n n n n n y
o n y n n y
p y y n y n
q n n y y y
r n n y y n
s y y y n y
t n y y n y
u n n n n y
v y n n y n
w n y n y n
x n y n y y
y y n y n n
z n n y n y
What I want to achieve from this is a summary table that shows the following, where percentage
is the number of businesses which have taken up the practice (practice == "y"):
Peer Practice percentage
y P1 30
y P2 40
y P3 33
y P4 90
n P1 20
n P2 30
n P3 11
n P4 75
Note I have not calculated the percentages in the second table, they are made up.
While I think this should be achievable with data.table (this question is particularly relevant) I can only work out what seems to me an over-elaborate way to create this table. Example for P1:
Firstly, calculate the percentage occurrences in each category for this practice:
P1Stats <- setDT(dtUptake)[order(Peer), .(P1 = unique(P1), percentage = 100 * tabulate(P1)/.N), by = Peer]
Peer P1 percentage
1: n n 53.84615
2: n y 46.15385
3: y y 76.92308
4: y n 23.07692
From the results, only select the rows where the uptake of P1
is "y":
P1Stats <- P1Stats[P1=="y"]
Peer P1 percentage
1: n y 46.15385
2: y y 76.92308
Change the name of the column P1
to Practice
P1Stats <- P1Stats[, .(Peer, Practice = P1, percentage)]
Peer Practice percentage
1: n y 46.15385
2: y y 76.92308
Replace the y
value in the Practice
column with the practice name P1
:
P1Stats[, Practice := "P1"]
Peer Practice percentage
1: n P1 46.15385
2: y P1 76.92308
Then, if I carry this out for each of my practices I can create the final table by pulling them all together. I understand that I can amalgamate the three top steps into one, and then do the update process after that:
P1Stats <- setDT(dtUptake)[order(Peer), .(P1 = unique(P1), percentage = 100 * tabulate(P1)/.N), by = Peer][P1=="y"][, .(Peer, Practice = P1, percentage)]
P1Stats[, Practice := "P1"]
However, this is still not very elegant. I would like to know if there is a way to make this process more streamlined?
Upvotes: 1
Views: 347
Reputation: 345
Using melt
and data.table
's by
parameter, you can achieve that. For instance, supposing your original data.table
is in variable dt
here:
dt2 = melt.data.table(dt, id.vars=c('Business', 'Peer'), variable.name='practice')
dt2[,.(percentage=sum(value=='y')/.N*100), by=c('Peer','practice')]
The by
operation will group on values of a column, or set of columns. As the original data has those values that are intended for grouping (P1,P2,P3,P4) not in column values, but spread in different column names, the long format produced by melt
is the way to go.
Upvotes: 3