Tiberius
Tiberius

Reputation: 145

Run a basic correlation between two columns of a dataframe

I am trying to be able to produce a correlation matrix from a pandas dataframe using data from specified columns

Here is my csv data:

col0,col1,col2,col3,col4
122468.9071,1417464.203,3546600,151804924,10839476
14691.1139,170036.0407,103847,19208604,2365065

Here are the two dataframes I created:

df1 = pd.read_csv('c:/temp/test_1.csv', usecols=[0])
df2 = pd.read_csv('c:/temp/test_1.csv', usecols=[1])

I tried the corr and corrwith functions and get the following errors:

Corr Function:

print df1.corr(df2)

Result: 

Error: Could not compare ['pearson'] with block values

Corrwith:

print df1.corrwith(df2)

Result:    

col0   NaN
col1   NaN
dtype: float64

As you can see, there are no null values in the data set and the float64 should be able to handle decimals.

Any assistance on a solve would be greatly appreciated.

Tiberius

Upvotes: 4

Views: 13476

Answers (1)

Josh Baker
Josh Baker

Reputation: 608

If you are trying to create a correlation matrix between the two columns, I would suggest bringing them into the same dataframe, like so:

df = pd.read_csv('c:/temp/test_1.csv', usecols=[0,1])
df.corr()

I loaded your data into a csv myself and got a 2x2 correlation matrix of all 1s, which is expected.

You can find documentation on the pandas correlation here: http://pandas.pydata.org/pandas-docs/stable/computation.html#correlation

Upvotes: 5

Related Questions