Reputation: 11

How to perform linear correlation on a data set and return the column name which has the most correlation?

I am working on a data set which has the closing prices of a stock.

'GOOG' : [
        742.66, 738.40, 738.22, 741.16,
        739.98, 747.28, 746.22, 741.80,
        745.33, 741.29, 742.83, 750.50
    ],
    'FB' : [
        108.40, 107.92, 109.64, 112.22,
        109.57, 113.82, 114.03, 112.24,
        114.68, 112.92, 113.28, 115.40
    ],
    'MSFT' : [
        55.40, 54.63, 54.98, 55.88,
        54.12, 59.16, 58.14, 55.97,
        61.20, 57.14, 56.62, 59.25
    ],
    'AAPL' : [
        106.00, 104.66, 104.87, 105.69,
        104.22, 110.16, 109.84, 108.86,
        110.14, 107.66, 108.08, 109.90
    ]

These are the closing prices for a period of last 12 days. I need to determine which pair of stocks from the given companies had the most highly correlated percentage change of daily closing prices and return them as an array.

import pandas as pd
import numpy as np

class StockPrices:
    # param prices dict of string to list. A dictionary containing the tickers of the stocks, and each tickers daily prices.
    # returns list of strings. A list containing the tickers of the two most correlated stocks.
    @staticmethod
    def most_corr(prices):
        return 


#For example, with the parameters below the function should return ['FB', 'MSFT'].
prices = {
    'GOOG' : [
        742.66, 738.40, 738.22, 741.16,
        739.98, 747.28, 746.22, 741.80,
        745.33, 741.29, 742.83, 750.50
    ],
    'FB' : [
        108.40, 107.92, 109.64, 112.22,
        109.57, 113.82, 114.03, 112.24,
        114.68, 112.92, 113.28, 115.40
    ],
    'MSFT' : [
        55.40, 54.63, 54.98, 55.88,
        54.12, 59.16, 58.14, 55.97,
        61.20, 57.14, 56.62, 59.25
    ],
    'AAPL' : [
        106.00, 104.66, 104.87, 105.69,
        104.22, 110.16, 109.84, 108.86,
        110.14, 107.66, 108.08, 109.90
    ]
}

print(StockPrices.most_corr(prices))

I have gone through numpy correlation function but how could use that exact functionality to determine which of the following two vectors have maximum correlation ?

Upvotes: 0

Answers (7)

ROSHAN JAMTHE

Reputation: 1

This will get you the fastest answer...

def most_corr(prices):
"""
:param prices: (pandas.DataFrame) A dataframe containing each ticker's 
               daily closing prices.
:returns: (container of strings) A container, containing the two tickers that 
          are the most highly (linearly) correlated by daily percentage change.
"""
# Getting % change 
df = prices.pct_change().dropna(how="any") 
# Getting correlation and replacing 1 with 0
corr_df = df.corr().replace(1,0)
# Getting sorted max correlation (sorting is ascending by default)
max = corr.max().sort_values()
# return the last two tickers which are most highly correlated 
return max.index.tolist()[-2:]

Upvotes: 0

claudio coppola

Reputation: 1

Here is my solution :)

import pandas as pd
import numpy as np

def most_corr(prices):
    """
    :param prices: (pandas.DataFrame) A dataframe containing each ticker's 
                   daily closing prices.
    :returns: (container of strings) A container, containing the two tickers that 
              are the most highly (linearly) correlated by daily percentage change.
    """
    
    corr_df = prices.pct_change().dropna(how='any').corr()
    pairs = [
        ((col, corr_df[col].iloc[i+1:].idxmax()),
         corr_df[col].iloc[i+1:].max()) for i, col in enumerate(prices.columns) \
             if i+1 < len(prices.columns)]
    return max(pairs, key=lambda x: x[1])[0]

#For example, the code below should print: ('FB', 'MSFT')
print(most_corr(pd.DataFrame.from_dict({
    'GOOG' : [
        742.66, 738.40, 738.22, 741.16,
        739.98, 747.28, 746.22, 741.80,
        745.33, 741.29, 742.83, 750.50
    ],
    'FB' : [
        108.40, 107.92, 109.64, 112.22,
        109.57, 113.82, 114.03, 112.24,
        114.68, 112.92, 113.28, 115.40
    ],
    'MSFT' : [
        55.40, 54.63, 54.98, 55.88,
        54.12, 59.16, 58.14, 55.97,
        61.20, 57.14, 56.62, 59.25
    ],
    'AAPL' : [
        106.00, 104.66, 104.87, 105.69,
        104.22, 110.16, 109.84, 108.86,
        110.14, 107.66, 108.08, 109.90
    ]
})))

Upvotes: 0

IMANE ZRIAA

Reputation: 11

import pandas as pd
import numpy as np

def most_corr(prices):
    """
    :param prices: (pandas.DataFrame) A dataframe containing each ticker's 
                   daily closing prices.
    :returns: (container of strings) A container, containing the two tickers that 
              are the most highly (linearly) correlated by daily percentage change.
    """
    l=list()
    price=prices.pct_change().dropna(how="any")
    df=price.corr()
    for col in df.columns:
        l.append(sorted(df[col].values)[-2])
    df[df.isin([max(l)]).any()==True]
    val=df[df.isin([max(l)]).any()==True].reset_index()['index'].unique()
    return val
    



#For example, the code below should print: ('FB', 'MSFT')
print(most_corr(pd.DataFrame.from_dict({
    'GOOG' : [
        742.66, 738.40, 738.22, 741.16,
        739.98, 747.28, 746.22, 741.80,
        745.33, 741.29, 742.83, 750.50
    ],
    'FB' : [
        108.40, 107.92, 109.64, 112.22,
        109.57, 113.82, 114.03, 112.24,
        114.68, 112.92, 113.28, 115.40
    ],
    'MSFT' : [
        55.40, 54.63, 54.98, 55.88,
        54.12, 59.16, 58.14, 55.97,
        61.20, 57.14, 56.62, 59.25
    ],
    'AAPL' : [
        106.00, 104.66, 104.87, 105.69,
        104.22, 110.16, 109.84, 108.86,
        110.14, 107.66, 108.08, 109.90
    ]
})))

Upvotes: 1

RyanAbnavi

Reputation: 378

Here is my solution which passes all tests:

import pandas as pd
import numpy as np

def most_corr(prices):
    """
    :param prices: (pandas.DataFrame) A dataframe containing each ticker's 
                   daily closing prices.
    :returns: (container of strings) A container, containing the two tickers that 
              are the most highly (linearly) correlated by daily percentage change.
    """
    n_cols = prices.shape[1] 
    df = prices.pct_change().dropna(how="any")
    cor = df.corr()
    mx, row, col = 0,0,0
    for i in range(n_cols):
        for j in range(i+1,n_cols):

            if abs(cor.iloc[i,j]>mx):
                mx = cor.iloc[i,j]
                row = i
                col = j
    return [prices.columns[row], prices.columns[col]]

Upvotes: 0

f0xdx

Reputation: 1469

As pointed out above, you can use the builtin computation of Pearson's R on a dataframe, by calling the corr() function:

df = pd.DataFrame(prices)
df = df.pct_change()
df.corr()

Note that what you are most likely interested in is the correlation of the daily returns of a stock, i.e., the daily percentage changes of each symbol. If you compute the correlation of the actual values you might see distortion effects due to different price levels. Daily returns can be computed with pandas' pct_change() function.

Maximum correlation for a given symbol can then be obtained by calling, e.g., df.corr()['AAPL'].nlargest(2) (note that df.corr().max() trivially returns the 1.0 correlation of each symbol with itself) but in many cases you might be interested in selecting the values above a certain threshold, e.g.,

df.corr() > 0.85

Upvotes: 1

Simon

Reputation: 333

If you want to not go the Pandas route, you could do it yourself using python tools:

import itertools
import operator

tuples = list(itertools.combinations(prices.keys(), 2))

correlations = {}
for pair in tuples:
    correlations.update({pair: np.corrcoef(prices[pair[0]],prices[pair[1]])[1,0]})

max(correlations.keys(), key=(lambda key: correlations[key]))

The first step creates all pairwise combinations. Then it creates a dictionary for each pairwise combination and its respective coefficient, then returns the maximum.

The pandas answer is nice, but you then need to parse that dataframe to find the right value, which is also a fine way of doing things :)

Upvotes: 1

Milton Arango G

Reputation: 805

You could use pandas corr function by converting your dictionary into a dataframe. This function returns the correlation matrix for the numeric columns in your dataframe.

import pandas as pd

prices = {
    'GOOG' : [
        742.66, 738.40, 738.22, 741.16,
        739.98, 747.28, 746.22, 741.80,
        745.33, 741.29, 742.83, 750.50
    ],
    'FB' : [
        108.40, 107.92, 109.64, 112.22,
        109.57, 113.82, 114.03, 112.24,
        114.68, 112.92, 113.28, 115.40
    ],
    'MSFT' : [
        55.40, 54.63, 54.98, 55.88,
        54.12, 59.16, 58.14, 55.97,
        61.20, 57.14, 56.62, 59.25
    ],
    'AAPL' : [
        106.00, 104.66, 104.87, 105.69,
        104.22, 110.16, 109.84, 108.86,
        110.14, 107.66, 108.08, 109.90
    ]
}

df = pd.DataFrame.from_dict(prices)
print(df.corr())

Out:

          AAPL        FB      GOOG      MSFT
AAPL  1.000000  0.886750  0.853015  0.894846
FB    0.886750  1.000000  0.799421  0.858784
GOOG  0.853015  0.799421  1.000000  0.820544
MSFT  0.894846  0.858784  0.820544  1.000000

The pearson correlation is calculated by default (which is the standard), if you need another method, kendall and spearman are also available.

Upvotes: 2

How to perform linear correlation on a data set and return the column name which has the most correlation?

Answers (7)

Related Questions