echo55
echo55

Reputation: 329

Julia DataFrames.jl double group by

recently i've been really struggling with this , i thought maybe someone can help me with it , here is the problem:

I have a dataframe that represent what a client listen (music) , one user_key = one client , one client can have many rows . I have many columns like the date of stream , the genre the client listened , the album name .... and a column named TOTAL_LISTENED that represent the amount of time this client listened the album , on which app etc .

1000×9 DataFrame. Omitted printing of 2 columns
│ Row  │ USER_KEY                         │ STREAM_DATE │ GENRE_MAIN        │ ALBUM_NAME                                           │ ALBUM_ARTIST_NAME              │ APP_SHORT_NAME │ USER_TRIAL_STATUS │
│      │ String                           │ Dates.Date  │ String            │ String                                               │ String                         │ String         │ Bool              │
├──────┼──────────────────────────────────┼─────────────┼───────────────────┼──────────────────────────────────────────────────────┼────────────────────────────────┼────────────────┼───────────────────┤
│ 1    │ 0000e19d6a5608e3787e5bfea98488ca │ 2021-01-16  │ CLASSICAL         │ The Liszt Collection                                 │ VARIOUS ARTISTS                │ app_1          │ 0                 │
│ 2    │ 0000e19d6a5608e3787e5bfea98488ca │ 2021-01-19  │ CLASSICAL         │ Schumann: Music for Clarinet                         │ PATRICK MESSINA                │ app_1          │ 0                 │
│ 3    │ 0000e19d6a5608e3787e5bfea98488ca │ 2021-01-16  │ CLASSICAL         │ Schumann: Music for Clarinet                         │ PATRICK MESSINA                │ app_1          │ 0                 │
│ 4    │ 0000e19d6a5608e3787e5bfea98488ca │ 2020-12-28  │ JAZZ              │ Turn Up The Quiet                                    │ DIANA KRALL                    │ app_2          │ 0                 │
│ 5    │ 0000e19d6a5608e3787e5bfea98488ca │ 2021-01-03  │ CLASSICAL         │ Johann Sebastian Bach                                │ VÍKINGUR ÓLAFSSON              │ app_1          │ 0                 │
│ 6    │ 0000e19d6a5608e3787e5bfea98488ca │ 2021-01-20  │ CLASSICAL         │ Barbara Bonney - The Radiant Voice of Barbara Bonney │ BARBARA BONNEY                 │ app_1          │ 0                 │
│ 7    │ 0000e19d6a5608e3787e5bfea98488ca │ 2020-12-22  │ CLASSICAL         │ The Liszt Collection                                 │ VARIOUS ARTISTS                │ app_1          │ 0                 │
⋮
│ 993  │ 000a149099a55a6e74fd253ab3f5709d │ 2020-12-30  │ CLASSICAL         │ Handel: Concerti grossi, Op. 6 Nos. 1-6              │ AKADEMIE FÜR ALTE MUSIK BERLIN │ app_2          │ 0                 │
│ 994  │ 000a149099a55a6e74fd253ab3f5709d │ 2021-01-01  │ SOUL / FUNK / R&B │ Gold - 20 Super Hits (International)                 │ BONEY M.                       │ app_2          │ 0                 │
│ 995  │ 000a149099a55a6e74fd253ab3f5709d │ 2020-12-25  │ SOUL / FUNK / R&B │ Gold - 20 Super Hits (International)                 │ BONEY M.                       │ app_2          │ 0                 │
│ 996  │ 000a149099a55a6e74fd253ab3f5709d │ 2021-01-08  │ CLASSICAL         │ Vivaldi: Concertos For Two Violins                   │ VIKTORIA MULLOVA               │ app_2          │ 0                 │
│ 997  │ 000a149099a55a6e74fd253ab3f5709d │ 2021-01-09  │ CLASSICAL         │ Wagner: Tannhäuser                                   │ GIUSEPPE SINOPOLI              │ app_6          │ 0                 │
│ 998  │ 000a149099a55a6e74fd253ab3f5709d │ 2021-01-17  │ CLASSICAL         │ Cyrillus Kreek - The Suspended Harp of Babel         │ VOX CLAMANTIS                  │ app_2          │ 0                 │
│ 999  │ 000a149099a55a6e74fd253ab3f5709d │ 2020-12-25  │ REGGAE            │ Exodus 30th Anniversary Edition                      │ BOB MARLEY & THE WAILERS       │ app_2          │ 0                 │
│ 1000 │ 000a149099a55a6e74fd253ab3f5709d │ 2021-01-08  │ CLASSICAL         │ Double concertos                                     │ SIMON STANDAGE                 │ app_2          │ 0                 │

total_listened didnt print here so i put it just under

df[:TOTAL_LISTENED]
1000-element Array{Float64,1}:
 4161.0
 3909.0
 3465.0
 2172.0
 2040.0
 1742.0
 1666.0
 1646.0
 1513.0
 1450.0
    ⋮
  659.0
  653.0
  652.0
  652.0
  650.0
  649.0
  647.0
  643.0
  642.0

I want for each client to aggregate by each column and calculate the sum of this total_listened.

For example let's say the client '0000e19d6a5608e3787e5bfea98488ca' , i want to know how much he listened to the genre CLASSICAL , how much he listened to the album 'Schumann: Music for Clarinet' etc .

I need to do a double groupby , the user_key and each other columns and then sum the total_listened column but i can't make it works.

When i try to do the groupby inside a groupby it tells me that i can't groupby a groupedDataFrame .

Also i don't really know how to reproduce this dataframe easily , i can share with you the CSV directly if you want .

Results i want:

genre_main -> i groupby user_key , genre_main then do a sum on total_listened which gives me the total_listened per genre and then i want to keep only the genre that are > 33% of the total of listened of the user (all genres that are > 33 % are in one columns as a string separated by a ',').

app_short_name -> exactly the same as genre_main.

album_name -> just get the top 1 album ( i know how to do it i guess , just sum total_listened when grouped by album_name and do a sort |> first )

Thank you very much

Upvotes: 2

Views: 439

Answers (1)

Bogumił Kamiński
Bogumił Kamiński

Reputation: 69869

First, please update DataFrames.jl to the latest release 0.22 to get the newest features of the package and bug fixes.

To get the total listened by genre do:

combine(groupby(df, [:USER_KEY, :GENRE_MAIN]), :TOTAL_LISTENED => sum)

To get the total listened by album just do:

combine(groupby(df, [:USER_KEY, :ALBUM_NAME]), :TOTAL_LISTENED => sum)

Is this what you are looking for?

EDIT:

genre_main -> i groupby user_key , genre_main then do a sum on total_listened which gives me the total_listened per genre and then i want to keep only the genre that are > 33% of the total of listened of the user (all genres that are > 33 % are in one columns as a string separated by a ',')

Can be done like this:

using Chain # to make reading the pipe easier
@chain df begin
    groupby(:USER_KEY)
    transform(:TOTAL_LISTENED => (x -> x ./ sum(x) => :FRACTION_LISTENED)
    groupby([:USER_LISTENED, :GENRE_MAIN])
    combine(:FRACTION_LISTENED => sum => :FRACTION_LISTENED)
    filter(:FRACTION_LISTENED => >=(0.33), _)
    groupby(:USER_LISTENED)
    combine(:GENRE_MAIN => (x -> join(x, ",")) => :FREQUENT_GENRES)
end

I am writing it from my head, but this is the way to do it. As a result you will get a data frame with two columns :USER_LISTENED and :FREQUENT_GENRES. I assume you can do the second part of your question similarly and then e.g. join the results on :USER_LISTENED.

Upvotes: 4

Related Questions