Reputation: 128
I have a Data frame:
Date Articles
2010-01-04 ((though, reliant, advertis, revenu, internet,...
2010-01-05 ((googl, expect, nexus, one, rival, iphon, hel...
2010-01-06 ((while, googl, introduc, first, piec, hardwar...
2010-01-07 ((googl, form, energi, subsidiari, appli, gove...
2010-01-08 ((david, pogu, review, googl, new, offer, nexu...
2010-01-12 ((the, compani, agre, hand, list, book, scan, ...
Date is index and Articles is a tuple of tuples.
I have another Dataframe:
Date Price
2010-01-08 602.020
2010-01-15 580.000
2010-01-22 550.010
2010-01-29 529.944
Where Date is also index but is divided into weeks.
My problem is that I Want to make another column in the second dataframe that would contain all the articles uptill that particular week, indicated by the index. Like for the first row in my 2nd dataframe, I would want all the articles, gouped up, from my first dataframe before 2010-01-08 (so that would be the first 4 entries from my first dataframe). Like wise for 2010-01-15, I would need all the articles from 2010-01-08 till 2010-01-14, and so on.
Any help would be appreciated. Thanks.
Upvotes: 2
Views: 99
Reputation: 18916
Maybe this rather simple two-liner could work too: (this makes use of the calendar week which does not break at january 8 2010 but rather january 11 or so)
m = {ind:dfx['Articles'].tolist() for ind,dfx in df1.groupby(df1.index.week)}
df2['new'] = pd.Series(df2.index.week).map(m).values
If you rather want the actual day we could modify this code to use the div of the calendar day:
m = {ind+1:dfx['Articles'].tolist() for ind,dfx in df1.groupby((df1.index.dayofyear-1)//7)}
df2['new'] = pd.Series(df2.index.week).map(m).values
Full example:
import pandas as pd
data1 = '''\
Date Articles
2010-01-04 1
2010-01-05 2
2010-01-06 3
2010-01-07 4
2010-01-08 5'''
data2 = '''\
Date Price
2010-01-08 602.020
2010-01-15 580.000
2010-01-22 550.010
2010-01-29 529.944'''
df1 = pd.read_csv(pd.compat.StringIO(data1), sep='\s+', index_col='Date', parse_dates=['Date'])
df2 = pd.read_csv(pd.compat.StringIO(data2), sep='\s+', index_col='Date', parse_dates=['Date'])
m = {ind:dfx['Articles'].tolist() for ind,dfx in df1.groupby(df1.index.week)}
df2['new'] = pd.Series(df2.index.week).map(m).values
df2:
Price new
Date
2010-01-08 602.020 [1, 2, 3, 4, 5]
2010-01-15 580.000 NaN
2010-01-22 550.010 NaN
2010-01-29 529.944 NaN
or:
Price new
Date
2010-01-08 602.020 [1, 2, 3, 4]
2010-01-15 580.000 [5]
2010-01-22 550.010 NaN
2010-01-29 529.944 NaN
Upvotes: 0
Reputation: 862691
I believe need cut
by values of df2['Date']
with groupby and join tuples to list
s:
print (df1)
Date Articles
0 2010-01-04 ((t, r), (s, q))
1 2010-01-07 ((g, f), (y, l))
2 2010-01-08 ((d, p), (t, o))
3 2010-01-12 ((t, c), (r, p))
b = pd.concat([df2['Date'],
pd.Series(pd.to_datetime(['1970-01-01','2100-01-01']))]).sort_values()
df1['Dates'] = pd.cut(df1['Date'], bins=b, labels=b[1:], right=False)
df3 = (df1.groupby('Dates')['Articles']
.apply(lambda x: [i for s in x for i in s])
.iloc[:-1]
.reset_index())
print (df3)
Dates Articles
0 2010-01-08 [(t, r), (s, q), (g, f), (y, l)]
1 2010-01-15 [(d, p), (t, o), (t, c), (r, p)]
2 2010-01-22 []
3 2010-01-29 []
Last if want filter out empty lists
:
df3 = df3[df3['Articles'].astype(bool)]
print (df3)
Dates Articles
0 2010-01-08 [(t, r), (s, q), (g, f), (y, l)]
1 2010-01-15 [(d, p), (t, o), (t, c), (r, p)]
Upvotes: 0
Reputation: 7994
We can make use of IntervalIndex.from_breaks
and pd.cut
df1 = pd.DataFrame({'Articles':
{pd.Timestamp('2010-01-04 00:00:00'): [0, 1],
pd.Timestamp('2010-01-05 00:00:00'): [2, 3],
pd.Timestamp('2010-01-06 00:00:00'): [4, 5],
pd.Timestamp('2010-01-07 00:00:00'): [6, 7],
pd.Timestamp('2010-01-08 00:00:00'): [8, 9],
pd.Timestamp('2010-01-12 00:00:00'): [10, 11]}})
Articles
2010-01-04 [0, 1]
2010-01-05 [2, 3]
2010-01-06 [4, 5]
2010-01-07 [6, 7]
2010-01-08 [8, 9]
2010-01-12 [10, 11]
mybins = pd.IntervalIndex.from_breaks(
pd.date_range("2010-1-1", periods=5, freq="7D"),
closed="left"
)
df1["bin"] = pd.cut(df1.index, bins=mybins)
df1.groupby("bin")["Articles"].sum()
bin
[2010-01-01, 2010-01-08) [0, 1, 2, 3, 4, 5, 6, 7]
[2010-01-08, 2010-01-15) [8, 9, 10, 11]
[2010-01-15, 2010-01-22) None
[2010-01-22, 2010-01-29) None
Name: Articles, dtype: object
Upvotes: 1
Reputation: 11105
Here's a two-step solution using merge_asof, with allow_exact_matches=False
so that each article row is matched with the first price whose date is strictly greater than (not equal to) the article row's date.
The .agg(sum)
uses the fact that adding two tuples combines them into a single tuple.
Assuming your DataFrames are named df
and df2
:
# Test data adapted from your examples.
# Sorry that this is difficult to copy-paste into pandas
df
Articles
2010-01-04 (though, reliant, advertis, revenu, internet)
2010-01-05 ((googl, expect, nexus), (one, rival, iphon))
2010-01-06 ((while, googl, introduc), (first,), (piec, hardwar))
2010-01-07 ((googl, form), (energi, subsidiari), (appli,))
2010-01-08 ((david, pogu, review), (googl, new, offer))
2010-01-12 ((the, compani), (agre, hand, list), (book, scan))
df2
Price
2010-01-08 602.020
2010-01-15 580.000
2010-01-22 550.010
2010-01-29 529.944
# Solution
price2articles = (pd.merge_asof(df,
df2,
left_index=True,
right_index=True,
allow_exact_matches=False,
direction='forward')
.groupby('Price')
.agg(sum))
result = pd.merge(df2, price2article, left_on='Price', right_index=True)
# To see full contents of wide data, set
# pd.options.display.max_colwidth = 150 or higher (-1 for no limit)
result
Articles
2010-01-08 (though, reliant, advertis, revenu, internet, (googl, expect, nexus), (one, rival, iphon), (while, googl, introduc), (first,), (piec, hardwar), (googl, form), (energi, subsidiari), (appli,))
2010-01-15 ((david, pogu, review), (googl, new, offer), (the, compani), (agre, hand, list), (book, scan))
Upvotes: 0