Talal Zahid
Talal Zahid

Reputation: 128

Groupby one Dataframe by weeks

I have a Data frame:

Date        Articles
2010-01-04  ((though, reliant, advertis, revenu, internet,...
2010-01-05  ((googl, expect, nexus, one, rival, iphon, hel...
2010-01-06  ((while, googl, introduc, first, piec, hardwar...
2010-01-07  ((googl, form, energi, subsidiari, appli, gove...
2010-01-08  ((david, pogu, review, googl, new, offer, nexu...
2010-01-12  ((the, compani, agre, hand, list, book, scan, ...

Date is index and Articles is a tuple of tuples.

I have another Dataframe:

Date        Price
2010-01-08  602.020
2010-01-15  580.000
2010-01-22  550.010
2010-01-29  529.944

Where Date is also index but is divided into weeks.

My problem is that I Want to make another column in the second dataframe that would contain all the articles uptill that particular week, indicated by the index. Like for the first row in my 2nd dataframe, I would want all the articles, gouped up, from my first dataframe before 2010-01-08 (so that would be the first 4 entries from my first dataframe). Like wise for 2010-01-15, I would need all the articles from 2010-01-08 till 2010-01-14, and so on.

Any help would be appreciated. Thanks.

Upvotes: 2

Views: 99

Answers (4)

Anton vBR
Anton vBR

Reputation: 18916

Maybe this rather simple two-liner could work too: (this makes use of the calendar week which does not break at january 8 2010 but rather january 11 or so)

m = {ind:dfx['Articles'].tolist() for ind,dfx in df1.groupby(df1.index.week)} 
df2['new'] = pd.Series(df2.index.week).map(m).values

If you rather want the actual day we could modify this code to use the div of the calendar day:

m = {ind+1:dfx['Articles'].tolist() for ind,dfx in df1.groupby((df1.index.dayofyear-1)//7)}
df2['new'] = pd.Series(df2.index.week).map(m).values

Full example:

import pandas as pd

data1 = '''\
Date        Articles
2010-01-04  1
2010-01-05  2
2010-01-06  3
2010-01-07  4
2010-01-08  5'''

data2 = '''\
Date        Price
2010-01-08  602.020
2010-01-15  580.000
2010-01-22  550.010
2010-01-29  529.944'''

df1 = pd.read_csv(pd.compat.StringIO(data1), sep='\s+', index_col='Date', parse_dates=['Date'])
df2 = pd.read_csv(pd.compat.StringIO(data2), sep='\s+', index_col='Date', parse_dates=['Date'])

m = {ind:dfx['Articles'].tolist() for ind,dfx in df1.groupby(df1.index.week)}

df2['new'] = pd.Series(df2.index.week).map(m).values

df2:

              Price              new
Date                                
2010-01-08  602.020  [1, 2, 3, 4, 5]
2010-01-15  580.000              NaN
2010-01-22  550.010              NaN
2010-01-29  529.944              NaN

or:

              Price           new
Date                             
2010-01-08  602.020  [1, 2, 3, 4]
2010-01-15  580.000           [5]
2010-01-22  550.010           NaN
2010-01-29  529.944           NaN

Upvotes: 0

jezrael
jezrael

Reputation: 862691

I believe need cut by values of df2['Date'] with groupby and join tuples to lists:

print (df1)
        Date          Articles
0 2010-01-04  ((t, r), (s, q))
1 2010-01-07  ((g, f), (y, l))
2 2010-01-08  ((d, p), (t, o))
3 2010-01-12  ((t, c), (r, p))

b = pd.concat([df2['Date'], 
               pd.Series(pd.to_datetime(['1970-01-01','2100-01-01']))]).sort_values()

df1['Dates'] = pd.cut(df1['Date'], bins=b, labels=b[1:], right=False)
df3 = (df1.groupby('Dates')['Articles']
         .apply(lambda x: [i for s in x for i in s])
         .iloc[:-1]
         .reset_index())
print (df3)
       Dates                          Articles
0 2010-01-08  [(t, r), (s, q), (g, f), (y, l)]
1 2010-01-15  [(d, p), (t, o), (t, c), (r, p)]
2 2010-01-22                                []
3 2010-01-29                                []

Last if want filter out empty lists:

df3 = df3[df3['Articles'].astype(bool)]
print (df3)
       Dates                          Articles
0 2010-01-08  [(t, r), (s, q), (g, f), (y, l)]
1 2010-01-15  [(d, p), (t, o), (t, c), (r, p)]

Upvotes: 0

Tai
Tai

Reputation: 7994

We can make use of IntervalIndex.from_breaks and pd.cut

df1 = pd.DataFrame({'Articles': 
                   {pd.Timestamp('2010-01-04 00:00:00'): [0, 1],
                    pd.Timestamp('2010-01-05 00:00:00'): [2, 3],
                    pd.Timestamp('2010-01-06 00:00:00'): [4, 5],
                    pd.Timestamp('2010-01-07 00:00:00'): [6, 7],
                    pd.Timestamp('2010-01-08 00:00:00'): [8, 9],
                    pd.Timestamp('2010-01-12 00:00:00'): [10, 11]}})

            Articles
2010-01-04  [0, 1]
2010-01-05  [2, 3]
2010-01-06  [4, 5]
2010-01-07  [6, 7]
2010-01-08  [8, 9]
2010-01-12  [10, 11]

mybins = pd.IntervalIndex.from_breaks(
             pd.date_range("2010-1-1", periods=5, freq="7D"),
             closed="left"
         )

df1["bin"] = pd.cut(df1.index, bins=mybins)
df1.groupby("bin")["Articles"].sum()

bin
[2010-01-01, 2010-01-08)    [0, 1, 2, 3, 4, 5, 6, 7]
[2010-01-08, 2010-01-15)              [8, 9, 10, 11]
[2010-01-15, 2010-01-22)                        None
[2010-01-22, 2010-01-29)                        None
Name: Articles, dtype: object

Upvotes: 1

Peter Leimbigler
Peter Leimbigler

Reputation: 11105

Here's a two-step solution using merge_asof, with allow_exact_matches=False so that each article row is matched with the first price whose date is strictly greater than (not equal to) the article row's date.

The .agg(sum) uses the fact that adding two tuples combines them into a single tuple.

Assuming your DataFrames are named df and df2:

# Test data adapted from your examples.
# Sorry that this is difficult to copy-paste into pandas

df
            Articles
2010-01-04  (though, reliant, advertis, revenu, internet)        
2010-01-05  ((googl, expect, nexus), (one, rival, iphon))        
2010-01-06  ((while, googl, introduc), (first,), (piec, hardwar))
2010-01-07  ((googl, form), (energi, subsidiari), (appli,))      
2010-01-08  ((david, pogu, review), (googl, new, offer))         
2010-01-12  ((the, compani), (agre, hand, list), (book, scan)) 

df2
            Price               
2010-01-08  602.020
2010-01-15  580.000
2010-01-22  550.010
2010-01-29  529.944


# Solution

price2articles = (pd.merge_asof(df, 
                               df2, 
                               left_index=True, 
                               right_index=True, 
                               allow_exact_matches=False,
                               direction='forward')
                .groupby('Price')
                .agg(sum))

result = pd.merge(df2, price2article, left_on='Price', right_index=True)
# To see full contents of wide data, set
# pd.options.display.max_colwidth = 150 or higher (-1 for no limit)
result

            Articles                                                                                                                                                                                                          
2010-01-08  (though, reliant, advertis, revenu, internet, (googl, expect, nexus), (one, rival, iphon), (while, googl, introduc), (first,), (piec, hardwar), (googl, form), (energi, subsidiari), (appli,))  
2010-01-15  ((david, pogu, review), (googl, new, offer), (the, compani), (agre, hand, list), (book, scan))

Upvotes: 0

Related Questions