Reputation: 237
Suppose we want to construct a stock portfolio. To decide which stocks to include in the portfolio and what weight to assign to these stocks, we use different metrics such as e.g., price, earnings-per-share (eps), dividend yield, etc... All these metrics are stored in individual pandas dataframes where rows specify a certain point in time and columns are associated with a specific stock (e.g., IBM, MSFT, ...):
import pandas as pd
price = pd.DataFrame([[-1.332298, 0.396217, 0.574269, -0.679972, -0.470584, 0.234379],
[-0.222567, 0.281202, -0.505856, -1.392477, 0.941539, 0.974867],
[-1.139867, -0.458111, -0.999498, 1.920840, 0.478174, -0.315904],
[-0.189720, -0.542432, -0.471642, 1.506206, -1.506439, 0.301714]],
columns=['IBM', 'MSFT', 'APPL', 'ORCL','FB','TWTR'],
index=pd.date_range('2000', freq='D', periods=4))
eps = pd.DataFrame([[-1.91, 1.63, 0.51, -.32, -0.84, 0.37],
[-0.56, 0.02, 0.56, 1.77, 0.99, 0.97],
[-1.67, -0.41, -0.98, 1.20, 0.74, -0.04],
[-0.80, -0.43, -0.12, 1.06, 1.59, 0.34]],
columns=['IBM', 'MSFT', 'APPL', 'ORCL','FB','TWTR'],
index=pd.date_range('2000', freq='D', periods=4))
price
IBM MSFT APPL ORCL FB TWTR
2000-01-01 -1.332298 0.396217 0.574269 -0.679972 -0.470584 0.234379
2000-01-02 -0.222567 0.281202 -0.505856 -1.392477 0.941539 0.974867
2000-01-03 -1.139867 -0.458111 -0.999498 1.920840 0.478174 -0.315904
2000-01-04 -0.189720 -0.542432 -0.471642 1.506206 -1.506439 0.301714
eps
IBM MSFT APPL ORCL FB TWTR
2000-01-01 -1.91 1.63 0.51 -0.32 -0.84 0.37
2000-01-02 -0.56 0.02 0.56 1.77 0.99 0.97
2000-01-03 -1.67 -0.41 -0.98 1.20 0.74 -0.04
2000-01-04 -0.80 -0.43 -0.12 1.06 1.59 0.34
The different dataframes are obviously closely connected. However, they are all stored in separate variables. In a large application, it can become difficult to keep track of which variables belong together and form a coherent unit. What is a good design paradigm to arrange this kind of related datasets?
Using an object-oriented design pattern, I would construct something like a StockList() object that stores individual Stock() objects, which in turn store the information (time series) that correspond to a specific stock.
class Stock():
def __init__(self, price_series, eps_series, div_yield_series):
self.price = price_series
self.eps = eps_series
self.div_yield = div_yield_series
class StockList():
def __init__(self, stock_list):
self.stock_list = stock_list
def append(self, stock):
self.stock_list.append(stock)
But is this a viable option when working with dataframes? I think taking the time series apart and merging them back together when queried, leads to a considerable loss in performance and a superfluous set of operations.
Alternatively, the StockList() could store the dataframes directly, without constructing single Stock() objects (serving more or less as a data structure). However, is this an appropriate compromise?
I generally wonder whether a separate object should be created at all or if these individual dataframes should just be left as separate variables. This most likely would increase performance, reduce memory usage, support parallel computing and foster a functional programming style.
But how can we then bundle data that belongs together?
Upvotes: 7
Views: 947
Reputation: 81
Pandas is very slow when you do append like operations. When you're working with parallel computing, it is better not to use Stock() objects because it would slow down the operation. Instead, if you want to keep track of the price and eps, the simple solution would be to rename it and access the dataframe.
Eg..
Then rename columns according to dataframe name.
And combine both dataframe.
And if you want to revert it back to original dataframe. Then,
Upvotes: 0
Reputation: 4459
If I understand your questions correctly, you basically have 2 (or multiple) dataframes that are related and you want to join together let's try it out with 2.
I do realize that this mainly a design pattern question but I'm showing you that you can easily have them as separate dataframes and then efficiently combine them together to end up with one whenever needed.
We'll do those steps:
[date, stock, DF_NAME]
DF_NAME value would be price/eps/..etc[date, stock]
.import pandas as pd
def reshape_df(df, value_name):
"""reshapes the dataframe by resetting the index and melting"""
df = df.reset_index()
df = df.melt(id_vars=['index'])
df.columns = ['date', 'stock', value_name]
return df
price = pd.DataFrame([[-1.332298, 0.396217, 0.574269, -0.679972, -0.470584, 0.234379],
[-0.222567, 0.281202, -0.505856, -1.392477, 0.941539, 0.974867],
[-1.139867, -0.458111, -0.999498, 1.920840, 0.478174, -0.315904],
[-0.189720, -0.542432, -0.471642, 1.506206, -1.506439, 0.301714]],
columns=['IBM', 'MSFT', 'APPL', 'ORCL','FB','TWTR'],
index=pd.date_range('2000', freq='D', periods=4))
eps = pd.DataFrame([[-1.91, 1.63, 0.51, -.32, -0.84, 0.37],
[-0.56, 0.02, 0.56, 1.77, 0.99, 0.97],
[-1.67, -0.41, -0.98, 1.20, 0.74, -0.04],
[-0.80, -0.43, -0.12, 1.06, 1.59, 0.34]],
columns=['IBM', 'MSFT', 'APPL', 'ORCL','FB','TWTR'],
index=pd.date_range('2000', freq='D', periods=4))
price_fixed = reshape_df(price, 'price')
eps_fixed = reshape_df(eps, 'eps')
merged = price_fixed.merge(eps_fixed, on=['date', 'stock'], how='outer')
# Optional
merged_pivoted = merged.pivot(columns='stock', index='date')
merged_pivoted_fixed_header = merged_pivoted.copy()
merged_pivoted_fixed_header.columns = ['-'.join(col).strip() for col in merged_pivoted_fixed_header.columns.values]
We'll first start by changing how the data is represented by changing it to a 3 column representation [date, stock, DF_NAME]
using this function
def rearrange(df, value_name):
"""rearranges the dataframe by resetting the index and melting"""
df = df.reset_index()
df = df.melt(id_vars=['index'])
df.columns = ['date', 'stock', value_name]
return df
which when calling on price
for example
price_fixed = reshape_df(price, 'price')
would give you
date stock price
0 2000-01-01 IBM -1.332298
1 2000-01-02 IBM -0.222567
2 2000-01-03 IBM -1.139867
3 2000-01-04 IBM -0.189720
4 2000-01-01 MSFT 0.396217
5 2000-01-02 MSFT 0.281202
6 2000-01-03 MSFT -0.458111
7 2000-01-04 MSFT -0.542432
8 2000-01-01 APPL 0.574269
9 2000-01-02 APPL -0.505856
10 2000-01-03 APPL -0.999498
11 2000-01-04 APPL -0.471642
12 2000-01-01 ORCL -0.679972
13 2000-01-02 ORCL -1.392477
14 2000-01-03 ORCL 1.920840
15 2000-01-04 ORCL 1.506206
16 2000-01-01 FB -0.470584
17 2000-01-02 FB 0.941539
18 2000-01-03 FB 0.478174
19 2000-01-04 FB -1.506439
20 2000-01-01 TWTR 0.234379
21 2000-01-02 TWTR 0.974867
22 2000-01-03 TWTR -0.315904
23 2000-01-04 TWTR 0.301714
and we do the same for eps
with
eps_fixed = reshape_df(eps, 'eps')
we merge them with
merged = price_fixed.merge(eps_fixed, on=['date', 'stock'], how='outer')
which gives us
date stock price eps
0 2000-01-01 IBM -1.332298 -1.91
1 2000-01-02 IBM -0.222567 -0.56
2 2000-01-03 IBM -1.139867 -1.67
3 2000-01-04 IBM -0.189720 -0.80
4 2000-01-01 MSFT 0.396217 1.63
5 2000-01-02 MSFT 0.281202 0.02
6 2000-01-03 MSFT -0.458111 -0.41
7 2000-01-04 MSFT -0.542432 -0.43
8 2000-01-01 APPL 0.574269 0.51
9 2000-01-02 APPL -0.505856 0.56
10 2000-01-03 APPL -0.999498 -0.98
11 2000-01-04 APPL -0.471642 -0.12
12 2000-01-01 ORCL -0.679972 -0.32
13 2000-01-02 ORCL -1.392477 1.77
14 2000-01-03 ORCL 1.920840 1.20
15 2000-01-04 ORCL 1.506206 1.06
16 2000-01-01 FB -0.470584 -0.84
17 2000-01-02 FB 0.941539 0.99
18 2000-01-03 FB 0.478174 0.74
19 2000-01-04 FB -1.506439 1.59
20 2000-01-01 TWTR 0.234379 0.37
21 2000-01-02 TWTR 0.974867 0.97
22 2000-01-03 TWTR -0.315904 -0.04
23 2000-01-04 TWTR 0.301714 0.34
If you'd like to have it in the same representation you had before, this is a job of pivot which can be done with
merged_pivoted = merged.pivot(columns='stock', index='date')
giving you
price eps
stock APPL FB IBM MSFT ORCL TWTR APPL FB IBM MSFT ORCL TWTR
date
2000-01-01 0.574269 -0.470584 -1.332298 0.396217 -0.679972 0.234379 0.51 -0.84 -1.91 1.63 -0.32 0.37
2000-01-02 -0.505856 0.941539 -0.222567 0.281202 -1.392477 0.974867 0.56 0.99 -0.56 0.02 1.77 0.97
2000-01-03 -0.999498 0.478174 -1.139867 -0.458111 1.920840 -0.315904 -0.98 0.74 -1.67 -0.41 1.20 -0.04
2000-01-04 -0.471642 -1.506439 -0.189720 -0.542432 1.506206 0.301714 -0.12 1.59 -0.80 -0.43 1.06 0.34
And since you mentioned that you don't want to work with Multi-index, you can squash the headers like
merged_pivoted_fixed_header = merged_pivoted.copy()
merged_pivoted_fixed_header.columns = ['-'.join(col).strip() for col in merged_pivoted_fixed_header.columns.values]
giving us
price-APPL price-FB price-IBM price-MSFT price-ORCL price-TWTR eps-APPL eps-FB eps-IBM eps-MSFT eps-ORCL eps-TWTR
date
2000-01-01 0.574269 -0.470584 -1.332298 0.396217 -0.679972 0.234379 0.51 -0.84 -1.91 1.63 -0.32 0.37
2000-01-02 -0.505856 0.941539 -0.222567 0.281202 -1.392477 0.974867 0.56 0.99 -0.56 0.02 1.77 0.97
2000-01-03 -0.999498 0.478174 -1.139867 -0.458111 1.920840 -0.315904 -0.98 0.74 -1.67 -0.41 1.20 -0.04
2000-01-04 -0.471642 -1.506439 -0.189720 -0.542432 1.506206 0.301714 -0.12 1.59 -0.80 -0.43 1.06 0.34
Upvotes: 1
Reputation: 1408
You are on the right track with the object-oriented design. From my experience these practices help when dealing with pandas dataframes in production settings.
Upvotes: 3
Reputation: 3001
This example has 3 measures, so I would create a pandas Series with a 3-level MultiIndex:
First, create the eps and price data frames as per the original post:
import pandas as pd
price = pd.DataFrame([[-1.332298, 0.396217, 0.574269, -0.679972, -0.470584, 0.234379],
[-0.222567, 0.281202, -0.505856, -1.392477, 0.941539, 0.974867],
[-1.139867, -0.458111, -0.999498, 1.920840, 0.478174, -0.315904],
[-0.189720, -0.542432, -0.471642, 1.506206, -1.506439, 0.301714]],
columns=['IBM', 'MSFT', 'APPL', 'ORCL','FB','TWTR'],
index=pd.date_range('2000', freq='D', periods=4))
eps = pd.DataFrame([[-1.91, 1.63, 0.51, -.32, -0.84, 0.37],
[-0.56, 0.02, 0.56, 1.77, 0.99, 0.97],
[-1.67, -0.41, -0.98, 1.20, 0.74, -0.04],
[-0.80, -0.43, -0.12, 1.06, 1.59, 0.34]],
columns=['IBM', 'MSFT', 'APPL', 'ORCL','FB','TWTR'],
index=pd.date_range('2000', freq='D', periods=4))
Second, combine these to create the new stock table (with multi-index):
# re-shape `eps` data frame
eps.index.name = 'date'
eps.columns.name = 'ticker'
eps = (eps.assign(metric='eps')
.set_index('metric', append=True)
.stack()
.swaplevel('metric', 'date')
.sort_index()
)
# re-shape `price` data frame
price.index.name = 'date'
price.columns.name = 'ticker'
price = (price.assign(metric='price')
.set_index('metric', append=True)
.stack()
.swaplevel('metric', 'date')
.sort_index())
# you could put, say, `volume` data frame here...
# concatenate
stock_data = pd.concat([eps, price]).rename('value')
# display
print(stock_data.head(8))
metric date ticker
eps 2000-01-01 APPL 0.51
FB -0.84
IBM -1.91
MSFT 1.63
ORCL -0.32
TWTR 0.37
2000-01-02 APPL 0.56
FB 0.99
Name: value, dtype: float64
The pandas MultiIndex is powerful, but non-intuitive for Data Frames. It's more straightforward for Series. Everything is specified with .loc[::]
. Then we can use .unstack()
to re-shape for further downstream processing (e.g., create Data Frame with dates on the rows, tickers on the columns, and create plots with Matplotlib)
# index level 0, scalar
t0 = stock_data.loc['eps']
# index level 1, range
t1 = stock_data.loc[:, '2000-01-02':'2000-01-03']
# index level 2, list
t2 = stock_data.loc[:, :, ['AAPL', 'MSFT', 'TWTR']]
Upvotes: 4
Reputation: 9207
You could do it like this:
import pandas as pd
df = pd.DataFrame.from_dict({'index': {0: '01.01.2000', 1: '01.01.2000', 2: '01.01.2000', 3: '01.01.2000'},
'company': {0: 'IBM', 1: 'IBM', 2: 'MSFT', 3: 'MSFT'},
'category': {0: 'price', 1: 'eps', 2: 'price', 3: 'eps'},
'value': {0: '-1.332.298', 1: '-1.91', 2: '0.396217', 3: '0.396217'}}).set_index("index")
df[['company', 'category']] = df[['company', 'category']].astype('category')
From there it is just slicing:
df_company = df[df['company']=="IBM"]
df_company_price = df_company[df_company["category"]=="price"]
You can set the column type already when initializing the dataframe if that is an issue. The pandas categorical type should improve the performance of the repeating values in the company and category column. After that you should be able to retrive data rather fast since most pandas functions are aready vectorized. To improve performance further you could look into dask as well.
Upvotes: 0