Reputation: 7723
I have a pandas dataframe like as shown below
Company,year
T123 Inc Ltd,1990
T124 PVT ltd,1991
T345 Ltd,1990
T789 Pvt.LTd,2001
ABC Limited,1992
ABCDE Ltd,1994
ABC Ltd,1997
ABFE,1987
Tesla ltd,1995
AMAZON Inc,2001
Apple ltd,2003
tf = pd.read_clipboard(sep=',')
tf['Company_copy'] = tf['Company']
I would like to compare each value from tf['company']
against 5 rows above and 5 rows below from its current position.
For ex: I want T123 Inc Ltd
to be compared with T124,T345,T789,ABC,ACDE
. As T123
is the top row, there is nothing to compare above, we don't generate any comparison pair.
Similarly, if there is noT enough (5 rows) rows to compare, we compare it with whatever we have.
So, I tried the below with the help of this post here
pd.MultiIndex.from_product([tf['Company'].astype(str),tf['Company_copy'].astype(str)]).to_series()
but it produces m*n comparison across all rows of the dataframe
This is because, we have million records and cannot use multiindex_from product to generate all comparisons (which is not useful for us). Just the top and bottom 5 comparison pair for each row is enough. Is there anyway to apply filter multiindex to generate pair based on above and below 5 rows?
I expect my output to be like as below. I show for only one record T123 Inc Ltd
.
Company Company
T123 Inc Ltd T124 PVT ltd (T123 Inc Ltd, T124 PVT ltd)
T345 Ltd (T123 Inc Ltd, T345 Ltd)
T789 Pvt.LTd (T123 Inc Ltd, T789 Pvt.LTd)
ABC Limited (T123 Inc Ltd, ABC Limited)
ABCDE Ltd (T123 Inc Ltd, ABCDE Ltd)
Upvotes: 1
Views: 104
Reputation: 11
Depending on how really do you want to compare, this may or may not work for you.
I would loop over the dataframe and create new columns. Something like...
for i in range(5):
# df['company_value'] represents the column that contains representative company values
df[f'compare below {i}'] = list(df.loc[[i+1:],'company_value']) + ([np.nan] * (i+1))
for i in range(5):
# df['company_value'] represents the column that contains representative company values
df[f'compare above {i}'] = ([np.nan] * (i+1)) + list(df.loc[[:-(i+1)],'company_value'])
these 2 loops will give you 2x5 columns that contain values for 5 companies above and below. You can perform and operations on them and then drop the columns
The syntax can definitely be improved (maybe without having to use list), but this should work just fine.
Upvotes: 1
Reputation: 4407
A possible solution using Series.rolling
with center=True
and window size 11
(=5+1+5
) on Company
and then exclude the tuples with the middle row:
from itertools import chain, product
idx = pd.MultiIndex.from_tuples(chain(*(product([row], win)
for row, win in zip(tf['Company'], tf['Company'].rolling(11, min_periods=1, center=True))))).to_series()
idx = idx[idx.index.get_level_values(0) != idx.index.get_level_values(1)]
Example results:
print(idx['T123 Inc Ltd'])
T124 PVT ltd (T123 Inc Ltd, T124 PVT ltd)
T345 Ltd (T123 Inc Ltd, T345 Ltd)
T789 Pvt.LTd (T123 Inc Ltd, T789 Pvt.LTd)
ABC Limited (T123 Inc Ltd, ABC Limited)
ABCDE Ltd (T123 Inc Ltd, ABCDE Ltd)
dtype: object
print(idx['ABCDE Ltd'])
T123 Inc Ltd (ABCDE Ltd, T123 Inc Ltd)
T124 PVT ltd (ABCDE Ltd, T124 PVT ltd)
T345 Ltd (ABCDE Ltd, T345 Ltd)
T789 Pvt.LTd (ABCDE Ltd, T789 Pvt.LTd)
ABC Limited (ABCDE Ltd, ABC Limited)
ABC Ltd (ABCDE Ltd, ABC Ltd)
ABFE (ABCDE Ltd, ABFE)
Tesla ltd (ABCDE Ltd, Tesla ltd)
AMAZON Inc (ABCDE Ltd, AMAZON Inc)
Apple ltd (ABCDE Ltd, Apple ltd)
dtype: object
print(idx['Apple ltd'])
ABCDE Ltd (Apple ltd, ABCDE Ltd)
ABC Ltd (Apple ltd, ABC Ltd)
ABFE (Apple ltd, ABFE)
Tesla ltd (Apple ltd, Tesla ltd)
AMAZON Inc (Apple ltd, AMAZON Inc)
dtype: object
There are two ways of accomplishing this:
indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=6)
idx = pd.MultiIndex.from_tuples(chain(*(product([row], win[1:])
for row, win in zip(tf['Company'],
tf['Company'].rolling(indexer, min_periods=1))))).to_series()
idx = pd.MultiIndex.from_tuples(chain(*(product([row], win[::-1])
for row, win in zip(tf['Company'][::-1],
tf['Company'][::-1]
.rolling(5, min_periods=1, closed='left'))))).to_series()
Example:
print(idx['T123 Inc Ltd'])
T124 PVT ltd (T123 Inc Ltd, T124 PVT ltd)
T345 Ltd (T123 Inc Ltd, T345 Ltd)
T789 Pvt.LTd (T123 Inc Ltd, T789 Pvt.LTd)
ABC Limited (T123 Inc Ltd, ABC Limited)
ABCDE Ltd (T123 Inc Ltd, ABCDE Ltd)
dtype: object
However, I think here the custom indexer is the cleanest because it is clearer what it does.
Upvotes: 1