pythonpandasoopdata-structuresfunctional-programming

Reputation: 237

What is a good design pattern to combine datasets that are related but stored in different dataframes?

Suppose we want to construct a stock portfolio. To decide which stocks to include in the portfolio and what weight to assign to these stocks, we use different metrics such as e.g., price, earnings-per-share (eps), dividend yield, etc... All these metrics are stored in individual pandas dataframes where rows specify a certain point in time and columns are associated with a specific stock (e.g., IBM, MSFT, ...):

import pandas as pd

price = pd.DataFrame([[-1.332298,  0.396217,  0.574269, -0.679972, -0.470584,  0.234379],
                      [-0.222567,  0.281202, -0.505856, -1.392477,  0.941539,  0.974867],
                      [-1.139867, -0.458111, -0.999498,  1.920840,  0.478174, -0.315904],
                      [-0.189720, -0.542432, -0.471642,  1.506206, -1.506439,  0.301714]],
                     columns=['IBM', 'MSFT', 'APPL', 'ORCL','FB','TWTR'], 
                     index=pd.date_range('2000', freq='D', periods=4))

eps = pd.DataFrame([[-1.91,  1.63,  0.51, -.32, -0.84,  0.37],
                      [-0.56,  0.02, 0.56, 1.77,  0.99,  0.97],
                      [-1.67, -0.41, -0.98,  1.20,  0.74, -0.04],
                      [-0.80, -0.43, -0.12,  1.06, 1.59,  0.34]],
                     columns=['IBM', 'MSFT', 'APPL', 'ORCL','FB','TWTR'], 
                     index=pd.date_range('2000', freq='D', periods=4))


price

    IBM MSFT    APPL    ORCL    FB  TWTR
2000-01-01  -1.332298   0.396217    0.574269    -0.679972   -0.470584   0.234379
2000-01-02  -0.222567   0.281202    -0.505856   -1.392477   0.941539    0.974867
2000-01-03  -1.139867   -0.458111   -0.999498   1.920840    0.478174    -0.315904
2000-01-04  -0.189720   -0.542432   -0.471642   1.506206    -1.506439   0.301714


eps

    IBM MSFT    APPL    ORCL    FB  TWTR
2000-01-01  -1.91   1.63    0.51    -0.32   -0.84   0.37
2000-01-02  -0.56   0.02    0.56    1.77    0.99    0.97
2000-01-03  -1.67   -0.41   -0.98   1.20    0.74    -0.04
2000-01-04  -0.80   -0.43   -0.12   1.06    1.59    0.34

The different dataframes are obviously closely connected. However, they are all stored in separate variables. In a large application, it can become difficult to keep track of which variables belong together and form a coherent unit. What is a good design paradigm to arrange this kind of related datasets?

Using an object-oriented design pattern, I would construct something like a StockList() object that stores individual Stock() objects, which in turn store the information (time series) that correspond to a specific stock.

class Stock():
    def __init__(self, price_series, eps_series, div_yield_series):
        self.price = price_series
        self.eps = eps_series
        self.div_yield = div_yield_series

class  StockList():
    def __init__(self, stock_list):
        self.stock_list = stock_list
        
    def append(self, stock):
        self.stock_list.append(stock)

But is this a viable option when working with dataframes? I think taking the time series apart and merging them back together when queried, leads to a considerable loss in performance and a superfluous set of operations.

Alternatively, the StockList() could store the dataframes directly, without constructing single Stock() objects (serving more or less as a data structure). However, is this an appropriate compromise?

I generally wonder whether a separate object should be created at all or if these individual dataframes should just be left as separate variables. This most likely would increase performance, reduce memory usage, support parallel computing and foster a functional programming style.

But how can we then bundle data that belongs together?

Upvotes: 7

Answers (5)

karan

Reputation: 81

Pandas is very slow when you do append like operations. When you're working with parallel computing, it is better not to use Stock() objects because it would slow down the operation. Instead, if you want to keep track of the price and eps, the simple solution would be to rename it and access the dataframe. Eg..

Then rename columns according to dataframe name.

And combine both dataframe.

And if you want to revert it back to original dataframe. Then,

Upvotes: 0

Jimmar

Reputation: 4459

If I understand your questions correctly, you basically have 2 (or multiple) dataframes that are related and you want to join together let's try it out with 2.
I do realize that this mainly a design pattern question but I'm showing you that you can easily have them as separate dataframes and then efficiently combine them together to end up with one whenever needed.

tl;dr

We'll do those steps:

change how the dataframes are represented to those columns [date, stock, DF_NAME] DF_NAME value would be price/eps/..etc
merge them together (outer merge to retain all data) based on [date, stock].
[Optional] pivot back into a multi-index.
[Optional] squash the headers to end up with a single index

import pandas as pd

def reshape_df(df, value_name):
    """reshapes the dataframe by resetting the index and melting"""
    df = df.reset_index()
    df = df.melt(id_vars=['index'])
    df.columns = ['date', 'stock', value_name]
    return df

price = pd.DataFrame([[-1.332298,  0.396217,  0.574269, -0.679972, -0.470584,  0.234379],
                      [-0.222567,  0.281202, -0.505856, -1.392477,  0.941539,  0.974867],
                      [-1.139867, -0.458111, -0.999498,  1.920840,  0.478174, -0.315904],
                      [-0.189720, -0.542432, -0.471642,  1.506206, -1.506439,  0.301714]],
                     columns=['IBM', 'MSFT', 'APPL', 'ORCL','FB','TWTR'], 
                     index=pd.date_range('2000', freq='D', periods=4))

eps = pd.DataFrame([[-1.91,  1.63,  0.51, -.32, -0.84,  0.37],
                      [-0.56,  0.02, 0.56, 1.77,  0.99,  0.97],
                      [-1.67, -0.41, -0.98,  1.20,  0.74, -0.04],
                      [-0.80, -0.43, -0.12,  1.06, 1.59,  0.34]],
                     columns=['IBM', 'MSFT', 'APPL', 'ORCL','FB','TWTR'], 
                     index=pd.date_range('2000', freq='D', periods=4))

price_fixed = reshape_df(price, 'price')
eps_fixed = reshape_df(eps, 'eps')
merged = price_fixed.merge(eps_fixed, on=['date', 'stock'], how='outer')
# Optional
merged_pivoted = merged.pivot(columns='stock', index='date')
merged_pivoted_fixed_header = merged_pivoted.copy()
merged_pivoted_fixed_header.columns = ['-'.join(col).strip() for col in merged_pivoted_fixed_header.columns.values]

Breakdown

We'll first start by changing how the data is represented by changing it to a 3 column representation [date, stock, DF_NAME] using this function

def rearrange(df, value_name):
    """rearranges the dataframe by resetting the index and melting"""
    df = df.reset_index()
    df = df.melt(id_vars=['index'])
    df.columns = ['date', 'stock', value_name]
    return df

which when calling on price for example

price_fixed = reshape_df(price, 'price')

would give you

    date        stock   price
0   2000-01-01  IBM     -1.332298
1   2000-01-02  IBM     -0.222567
2   2000-01-03  IBM     -1.139867
3   2000-01-04  IBM     -0.189720
4   2000-01-01  MSFT    0.396217
5   2000-01-02  MSFT    0.281202
6   2000-01-03  MSFT    -0.458111
7   2000-01-04  MSFT    -0.542432
8   2000-01-01  APPL    0.574269
9   2000-01-02  APPL    -0.505856
10  2000-01-03  APPL    -0.999498
11  2000-01-04  APPL    -0.471642
12  2000-01-01  ORCL    -0.679972
13  2000-01-02  ORCL    -1.392477
14  2000-01-03  ORCL    1.920840
15  2000-01-04  ORCL    1.506206
16  2000-01-01  FB      -0.470584
17  2000-01-02  FB      0.941539
18  2000-01-03  FB      0.478174
19  2000-01-04  FB      -1.506439
20  2000-01-01  TWTR    0.234379
21  2000-01-02  TWTR    0.974867
22  2000-01-03  TWTR    -0.315904
23  2000-01-04  TWTR    0.301714

and we do the same for eps with

eps_fixed = reshape_df(eps, 'eps')

we merge them with

merged = price_fixed.merge(eps_fixed, on=['date', 'stock'], how='outer')

which gives us

    date        stock   price       eps
0   2000-01-01  IBM     -1.332298   -1.91
1   2000-01-02  IBM     -0.222567   -0.56
2   2000-01-03  IBM     -1.139867   -1.67
3   2000-01-04  IBM     -0.189720   -0.80
4   2000-01-01  MSFT    0.396217    1.63
5   2000-01-02  MSFT    0.281202    0.02
6   2000-01-03  MSFT    -0.458111   -0.41
7   2000-01-04  MSFT    -0.542432   -0.43
8   2000-01-01  APPL    0.574269    0.51
9   2000-01-02  APPL    -0.505856   0.56
10  2000-01-03  APPL    -0.999498   -0.98
11  2000-01-04  APPL    -0.471642   -0.12
12  2000-01-01  ORCL    -0.679972   -0.32
13  2000-01-02  ORCL    -1.392477   1.77
14  2000-01-03  ORCL    1.920840    1.20
15  2000-01-04  ORCL    1.506206    1.06
16  2000-01-01  FB      -0.470584   -0.84
17  2000-01-02  FB      0.941539    0.99
18  2000-01-03  FB      0.478174    0.74
19  2000-01-04  FB      -1.506439   1.59
20  2000-01-01  TWTR    0.234379    0.37
21  2000-01-02  TWTR    0.974867    0.97
22  2000-01-03  TWTR    -0.315904   -0.04
23  2000-01-04  TWTR    0.301714    0.34

[Optional] Change back to the previous representation

If you'd like to have it in the same representation you had before, this is a job of pivot which can be done with

merged_pivoted = merged.pivot(columns='stock', index='date')

giving you

            price                                                                   eps
stock       APPL        FB          IBM         MSFT        ORCL        TWTR        APPL        FB  IBM     MSFT    ORCL    TWTR
date                                                
2000-01-01  0.574269    -0.470584   -1.332298   0.396217    -0.679972   0.234379    0.51    -0.84   -1.91   1.63    -0.32   0.37
2000-01-02  -0.505856   0.941539    -0.222567   0.281202    -1.392477   0.974867    0.56    0.99    -0.56   0.02    1.77    0.97
2000-01-03  -0.999498   0.478174    -1.139867   -0.458111   1.920840    -0.315904   -0.98   0.74    -1.67   -0.41   1.20    -0.04
2000-01-04  -0.471642   -1.506439   -0.189720   -0.542432   1.506206    0.301714    -0.12   1.59    -0.80   -0.43   1.06    0.34

And since you mentioned that you don't want to work with Multi-index, you can squash the headers like

merged_pivoted_fixed_header = merged_pivoted.copy()
merged_pivoted_fixed_header.columns = ['-'.join(col).strip() for col in merged_pivoted_fixed_header.columns.values]

giving us

            price-APPL  price-FB    price-IBM   price-MSFT  price-ORCL  price-TWTR  eps-APPL    eps-FB  eps-IBM eps-MSFT    eps-ORCL    eps-TWTR
date                                                
2000-01-01  0.574269    -0.470584   -1.332298   0.396217    -0.679972   0.234379    0.51        -0.84   -1.91   1.63        -0.32       0.37
2000-01-02  -0.505856   0.941539    -0.222567   0.281202    -1.392477   0.974867    0.56        0.99    -0.56   0.02        1.77        0.97
2000-01-03  -0.999498   0.478174    -1.139867   -0.458111   1.920840    -0.315904   -0.98       0.74    -1.67   -0.41       1.20        -0.04
2000-01-04  -0.471642   -1.506439   -0.189720   -0.542432   1.506206    0.301714    -0.12       1.59    -0.80   -0.43       1.06        0.34

Upvotes: 1

Gecko

Reputation: 1408

You are on the right track with the object-oriented design. From my experience these practices help when dealing with pandas dataframes in production settings.

Wrap your dataframes with an object that can handle data loading and validation. Be sure to check for data types and expected columns.
Collect all data of the same type in one dataframe for efficient querying. In your case, you would have 3 dataframes.
When you need to pivot/merge/filter the data, put that logic into a class method.
Use a caching decorator on your methods so you never repeat the same operation twice.
In those methods, instead of returning raw dataframes, return new objects that wrap the resulting dataframe or use subclassesed dataframes. This helps with type annotations and simplifies documentation.
You may end up with a lot of classes, so use dataclasses or attrs to define your classes in a concise way.

Upvotes: 3

jsmart

Reputation: 3001

This example has 3 measures, so I would create a pandas Series with a 3-level MultiIndex:

metric (eps, price)
date (2000-01-01, 2000-01-02, ...)
ticker ('AAPL', 'FB', ...)

First, create the eps and price data frames as per the original post:

import pandas as pd

price = pd.DataFrame([[-1.332298,  0.396217,  0.574269, -0.679972, -0.470584,  0.234379],
                      [-0.222567,  0.281202, -0.505856, -1.392477,  0.941539,  0.974867],
                      [-1.139867, -0.458111, -0.999498,  1.920840,  0.478174, -0.315904],
                      [-0.189720, -0.542432, -0.471642,  1.506206, -1.506439,  0.301714]],
                     columns=['IBM', 'MSFT', 'APPL', 'ORCL','FB','TWTR'], 
                     index=pd.date_range('2000', freq='D', periods=4))

eps = pd.DataFrame([[-1.91,  1.63,  0.51, -.32, -0.84,  0.37],
                      [-0.56,  0.02, 0.56, 1.77,  0.99,  0.97],
                      [-1.67, -0.41, -0.98,  1.20,  0.74, -0.04],
                      [-0.80, -0.43, -0.12,  1.06, 1.59,  0.34]],
                     columns=['IBM', 'MSFT', 'APPL', 'ORCL','FB','TWTR'], 
                     index=pd.date_range('2000', freq='D', periods=4))

Second, combine these to create the new stock table (with multi-index):

# re-shape `eps` data frame
eps.index.name = 'date'
eps.columns.name = 'ticker'
eps = (eps.assign(metric='eps')
       .set_index('metric', append=True)
       .stack()
       .swaplevel('metric', 'date')
       .sort_index()
      )

# re-shape `price` data frame
price.index.name = 'date'
price.columns.name = 'ticker'
price = (price.assign(metric='price')
         .set_index('metric', append=True)
         .stack()
         .swaplevel('metric', 'date')
         .sort_index())

# you could put, say, `volume` data frame here...

# concatenate
stock_data = pd.concat([eps, price]).rename('value')

# display
print(stock_data.head(8))

metric  date        ticker
eps     2000-01-01  APPL      0.51
                    FB       -0.84
                    IBM      -1.91
                    MSFT      1.63
                    ORCL     -0.32
                    TWTR      0.37
        2000-01-02  APPL      0.56
                    FB        0.99
Name: value, dtype: float64

The pandas MultiIndex is powerful, but non-intuitive for Data Frames. It's more straightforward for Series. Everything is specified with .loc[::]. Then we can use .unstack() to re-shape for further downstream processing (e.g., create Data Frame with dates on the rows, tickers on the columns, and create plots with Matplotlib)

# index level 0, scalar
t0 = stock_data.loc['eps']

# index level 1, range
t1 = stock_data.loc[:, '2000-01-02':'2000-01-03']

# index level 2, list
t2 = stock_data.loc[:, :, ['AAPL', 'MSFT', 'TWTR']]

Upvotes: 4

Andreas

Reputation: 9207

You could do it like this:

import pandas as pd
df = pd.DataFrame.from_dict({'index': {0: '01.01.2000', 1: '01.01.2000', 2: '01.01.2000', 3: '01.01.2000'},
 'company': {0: 'IBM', 1: 'IBM', 2: 'MSFT', 3: 'MSFT'},
 'category': {0: 'price', 1: 'eps', 2: 'price', 3: 'eps'},
 'value': {0: '-1.332.298', 1: '-1.91', 2: '0.396217', 3: '0.396217'}}).set_index("index")
df[['company', 'category']] = df[['company', 'category']].astype('category')

From there it is just slicing:

df_company = df[df['company']=="IBM"]
df_company_price = df_company[df_company["category"]=="price"]

You can set the column type already when initializing the dataframe if that is an issue. The pandas categorical type should improve the performance of the repeating values in the company and category column. After that you should be able to retrive data rather fast since most pandas functions are aready vectorized. To improve performance further you could look into dask as well.