Reputation: 95
Somewhat similar to Excel's VLOOKUP function, I am wanting to use a value in one dataframe (portfolios
below) to find an associated value in a second dataframe (returns
below) and populate a third dataframe (let's call this dataframe3 for now) with these returned values. I have found several posts based on left merges and map
, but my original two dataframes are of different structures, so these methods don't seem to fit (to me, at least).
I haven't made much progress, but here is the code I have:
Code
import pandas as pd
portfolios = pd.read_csv('portstst5_1.csv')
returns = pd.read_csv('Example_Returns.csv')
total_cols = len(portfolios.columns)
headers = list(portfolios)
concat = returns['PERMNO'].map(str) + returns['FROMDATE'].map(str)
idx = 2
returns.insert(loc=idx, column="concat", value=concat)
for i in range(total_cols):
col_len = portfolios.iloc[:,i].count()
for j in range(col_len):
print(portfolios.iat[j,i].astype('int').astype('str') + headers[i])
Data
This code will make a little more sense if I first describe my data:
portfolios
is a dataframe with 13 columns of varying lengths. The column headers are dates in YYYYMMDD format. Below each date header are identifiers which are five digit numeric codes. A snippet of portfolios
looks like this (some elements in some columns contain NaN):
20131231 20131130 20131031 20130930 20130831 20130731 20130630 \
0 93044.0 93044.0 13264.0 13264.0 89169.0 82486.0 91274.0
1 79702.0 91515.0 90710.0 81148.0 47387.0 88359.0 93353.0
2 85751.0 85724.0 88810.0 11513.0 85576.0 47387.0 85576.0
The data in returns
data originally consists of three columns and 799 rows and looks like this (all elements are populated with values):
PERMNO FROMDATE MORET
0 93044 20131231 -0.022304
1 79702 20131231 0.012283
2 85751 20131231 -0.016453
3 85576 20131231 0.038766
Desired Output
I would like to make a third dataframe that is structured identically to portfolios
. That is, it will have the same column header dates and the same number of rows in each column as does portfolios
, but instead of identifiers, it will contain the MORET
for the appropriate identifier/date combination. This is the reason for the concatenations in my code above - I am trying (perhaps unnecessarily) to create unique lookup values so I can communicate between portfolios
and returns
. For example, to populate dataframe3[0,0]
, I would look for the concatenated values from portfolios[0,0]
and headers[0]
(i.e. 9304420131231) in returns['concat']
and return the associated value in returns['MORET']
(i.e. -0.022304). I am stuck here on how to use the concatenated values to return my desired data.
Any thoughts are greatly appreciated.
Upvotes: 1
Views: 865
Reputation: 9264
IIUC:
Using a combination of melt
so the we can merge
values from returns
by desired columns. Then use pivot
to reshape the data back, as seen below.
portfolios.columns = portfolios.columns.astype(int)
newdf = portfolios.reset_index().melt(id_vars='index',var_name=['FROMDATE'],value_name='PERMNO').merge(returns,on=['FROMDATE','PERMNO'],how='left').pivot(index='index',columns='FROMDATE',values='MORET')
Which returnsthe DataFrame below
FROMDATE 20130630 20130731 20130831 20130930 20131031 20131130 20131231
index
0 NaN NaN NaN NaN NaN NaN -0.022304
1 NaN NaN NaN NaN NaN NaN 0.012283
2 NaN NaN NaN NaN NaN NaN -0.016453
Sort columns
newdf.loc[:,newdf.columns.sort_values(ascending=False)]
Upvotes: 1
Reputation: 5109
What you are trying to do is much simpler than how you tried doing it. You can first melt portfolios
to flip it and collect all the date columns as rows in a single column, then join it with returns
, and finally pivot to get the desired result. This is basically what @djk47463 did in one compound line, and my edited answer serves as a step-by-step breakdown of his.
Let's create your DataFrames to make the answer reproducible.
import pandas as pd
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
# Create df
rawText = StringIO("""
PERMNO FROMDATE MORET
0 93044 20131231 -0.022304
1 79702 20131231 0.012283
2 85751 20131231 -0.016453
3 85576 20131231 0.038766
4 93044 20131010 -0.02
5 79702 20131010 0.01
6 85751 20131010 -0.01
7 85576 20131010 0.03
""")
returns = pd.read_csv(rawText, sep = "\s+")
portfolios = pd.DataFrame({'20131010':[93044, 85751],
'20131231':[85576, 79702]})
Notice, the FROMDATE
column of returns
consists of numbers, but in portfolios
the date columns are strings. We must make them consistent:
df.FROMDATE = df.FROMDATE.astype(str)
Let's start the solution by melt
ing (i.e. unpivot) portfolios
:
portfolios = portfolios.melt(var_name=['FROMDATE'],value_name='PERMNO')
# portfolios:
FROMDATE PERMNO
0 20131010 93044
1 20131010 85751
2 20131231 85576
3 20131231 79702
Now you want to hold this pm
constant, and merge returns
to its lines whenever their PERMNO
s and FROMDATE
s match:
merged = pm.merge(df, how='left', on=['PERMNO', 'FROMDATE'])
# merged:
FROMDATE PERMNO MORET
0 20131010 93044 -0.020000
1 20131010 85751 -0.010000
2 20131231 85576 0.038766
3 20131231 79702 0.012283
Remember we had melt
ed (unpivoted) the portfolios
at the beginning? We should pivot
this result to give it the shape of portfolios
:
final = merged.pivot(index='PERMNO', columns='FROMDATE', values='MORET').reset_index()
# final:
FROMDATE PERMNO 20131010 20131231
0 79702 NaN 0.012283
1 85576 NaN 0.038766
2 85751 -0.01 NaN
3 93044 -0.02 NaN
Upvotes: 1
Reputation: 607
The typical way to do a vlookup in python is to create a series with what would be your left column in the index, and then slice that series by the lookup value. The NaNs complicate it a little. We'll make a series from returns
by using the set_index
method to set PERMNO
as the index for the dataframe, and then slicing by the column name to isolate the MORET
column as a series.
lookupseries = returns.set_index('PERMNO')['MORET']
def lookup(x):
try:
return lookupseries[x]
except:
return np.nan
newdf = portfolios.copy()
for c in newdf.columns:
newdf[c] = newdf[c].apply(lookup)
Upvotes: 0