Reputation: 329
recently i've been really struggling with this , i thought maybe someone can help me with it , here is the problem:
I have a dataframe that represent what a client listen (music) , one user_key = one client , one client can have many rows . I have many columns like the date of stream , the genre the client listened , the album name .... and a column named TOTAL_LISTENED that represent the amount of time this client listened the album , on which app etc .
1000×9 DataFrame. Omitted printing of 2 columns
│ Row │ USER_KEY │ STREAM_DATE │ GENRE_MAIN │ ALBUM_NAME │ ALBUM_ARTIST_NAME │ APP_SHORT_NAME │ USER_TRIAL_STATUS │
│ │ String │ Dates.Date │ String │ String │ String │ String │ Bool │
├──────┼──────────────────────────────────┼─────────────┼───────────────────┼──────────────────────────────────────────────────────┼────────────────────────────────┼────────────────┼───────────────────┤
│ 1 │ 0000e19d6a5608e3787e5bfea98488ca │ 2021-01-16 │ CLASSICAL │ The Liszt Collection │ VARIOUS ARTISTS │ app_1 │ 0 │
│ 2 │ 0000e19d6a5608e3787e5bfea98488ca │ 2021-01-19 │ CLASSICAL │ Schumann: Music for Clarinet │ PATRICK MESSINA │ app_1 │ 0 │
│ 3 │ 0000e19d6a5608e3787e5bfea98488ca │ 2021-01-16 │ CLASSICAL │ Schumann: Music for Clarinet │ PATRICK MESSINA │ app_1 │ 0 │
│ 4 │ 0000e19d6a5608e3787e5bfea98488ca │ 2020-12-28 │ JAZZ │ Turn Up The Quiet │ DIANA KRALL │ app_2 │ 0 │
│ 5 │ 0000e19d6a5608e3787e5bfea98488ca │ 2021-01-03 │ CLASSICAL │ Johann Sebastian Bach │ VÍKINGUR ÓLAFSSON │ app_1 │ 0 │
│ 6 │ 0000e19d6a5608e3787e5bfea98488ca │ 2021-01-20 │ CLASSICAL │ Barbara Bonney - The Radiant Voice of Barbara Bonney │ BARBARA BONNEY │ app_1 │ 0 │
│ 7 │ 0000e19d6a5608e3787e5bfea98488ca │ 2020-12-22 │ CLASSICAL │ The Liszt Collection │ VARIOUS ARTISTS │ app_1 │ 0 │
⋮
│ 993 │ 000a149099a55a6e74fd253ab3f5709d │ 2020-12-30 │ CLASSICAL │ Handel: Concerti grossi, Op. 6 Nos. 1-6 │ AKADEMIE FÜR ALTE MUSIK BERLIN │ app_2 │ 0 │
│ 994 │ 000a149099a55a6e74fd253ab3f5709d │ 2021-01-01 │ SOUL / FUNK / R&B │ Gold - 20 Super Hits (International) │ BONEY M. │ app_2 │ 0 │
│ 995 │ 000a149099a55a6e74fd253ab3f5709d │ 2020-12-25 │ SOUL / FUNK / R&B │ Gold - 20 Super Hits (International) │ BONEY M. │ app_2 │ 0 │
│ 996 │ 000a149099a55a6e74fd253ab3f5709d │ 2021-01-08 │ CLASSICAL │ Vivaldi: Concertos For Two Violins │ VIKTORIA MULLOVA │ app_2 │ 0 │
│ 997 │ 000a149099a55a6e74fd253ab3f5709d │ 2021-01-09 │ CLASSICAL │ Wagner: Tannhäuser │ GIUSEPPE SINOPOLI │ app_6 │ 0 │
│ 998 │ 000a149099a55a6e74fd253ab3f5709d │ 2021-01-17 │ CLASSICAL │ Cyrillus Kreek - The Suspended Harp of Babel │ VOX CLAMANTIS │ app_2 │ 0 │
│ 999 │ 000a149099a55a6e74fd253ab3f5709d │ 2020-12-25 │ REGGAE │ Exodus 30th Anniversary Edition │ BOB MARLEY & THE WAILERS │ app_2 │ 0 │
│ 1000 │ 000a149099a55a6e74fd253ab3f5709d │ 2021-01-08 │ CLASSICAL │ Double concertos │ SIMON STANDAGE │ app_2 │ 0 │
total_listened didnt print here so i put it just under
df[:TOTAL_LISTENED]
1000-element Array{Float64,1}:
4161.0
3909.0
3465.0
2172.0
2040.0
1742.0
1666.0
1646.0
1513.0
1450.0
⋮
659.0
653.0
652.0
652.0
650.0
649.0
647.0
643.0
642.0
I want for each client to aggregate by each column and calculate the sum of this total_listened.
For example let's say the client '0000e19d6a5608e3787e5bfea98488ca' , i want to know how much he listened to the genre CLASSICAL , how much he listened to the album 'Schumann: Music for Clarinet' etc .
I need to do a double groupby , the user_key and each other columns and then sum the total_listened column but i can't make it works.
When i try to do the groupby inside a groupby it tells me that i can't groupby a groupedDataFrame .
Also i don't really know how to reproduce this dataframe easily , i can share with you the CSV directly if you want .
Results i want:
genre_main -> i groupby user_key , genre_main then do a sum on total_listened which gives me the total_listened per genre and then i want to keep only the genre that are > 33% of the total of listened of the user (all genres that are > 33 % are in one columns as a string separated by a ',').
app_short_name -> exactly the same as genre_main.
album_name -> just get the top 1 album ( i know how to do it i guess , just sum total_listened when grouped by album_name and do a sort |> first )
Thank you very much
Upvotes: 2
Views: 439
Reputation: 69869
First, please update DataFrames.jl to the latest release 0.22 to get the newest features of the package and bug fixes.
To get the total listened by genre do:
combine(groupby(df, [:USER_KEY, :GENRE_MAIN]), :TOTAL_LISTENED => sum)
To get the total listened by album just do:
combine(groupby(df, [:USER_KEY, :ALBUM_NAME]), :TOTAL_LISTENED => sum)
Is this what you are looking for?
EDIT:
genre_main -> i groupby user_key , genre_main then do a sum on total_listened which gives me the total_listened per genre and then i want to keep only the genre that are > 33% of the total of listened of the user (all genres that are > 33 % are in one columns as a string separated by a ',')
Can be done like this:
using Chain # to make reading the pipe easier
@chain df begin
groupby(:USER_KEY)
transform(:TOTAL_LISTENED => (x -> x ./ sum(x) => :FRACTION_LISTENED)
groupby([:USER_LISTENED, :GENRE_MAIN])
combine(:FRACTION_LISTENED => sum => :FRACTION_LISTENED)
filter(:FRACTION_LISTENED => >=(0.33), _)
groupby(:USER_LISTENED)
combine(:GENRE_MAIN => (x -> join(x, ",")) => :FREQUENT_GENRES)
end
I am writing it from my head, but this is the way to do it.
As a result you will get a data frame with two columns :USER_LISTENED
and :FREQUENT_GENRES
. I assume you can do the second part of your question similarly and then e.g. join the results on :USER_LISTENED
.
Upvotes: 4