NineWasps
NineWasps

Reputation: 2273

how to find dependence between 2 column in df using python

I have data

city    inc     pop
New-York  29343,00  8683,00
Moscow  25896,00    17496,00
Boston   21785,00   15063,00
Berlin  20000,00    70453,00
London  44057,00    57398,00
Rome  24000,00  104831,00

I need to find how inc dependence from pop. I try to plot graph df.plot(x='inc', y='pop') but I awful graph, because i have 200 values. How can I do it better?

Upvotes: 0

Views: 2223

Answers (3)

Stefan
Stefan

Reputation: 42905

As noted, you get the correlation via:

df['inc'].corr(df['pop'])
-0.0279628856838

If you want a linear regression, you can use statsmodels.ols:

import statsmodels.api as sm
df['const'] = 1
model = sm.OLS(df['inc'], df[['const', 'pop']])
results = model.fit()
results.summary()

which yields:

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    inc   R-squared:                       0.001
Model:                            OLS   Adj. R-squared:                 -0.249
Method:                 Least Squares   F-statistic:                  0.003130
Date:                Tue, 21 Jun 2016   Prob (F-statistic):              0.958
Time:                        07:29:55   Log-Likelihood:                -62.413
No. Observations:                   6   AIC:                             128.8
Df Residuals:                       4   BIC:                             128.4
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const        2.78e+04   6548.318      4.246      0.013      9623.205   4.6e+04
pop           -0.0064      0.114     -0.056      0.958        -0.322     0.310
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   2.613
Prob(Omnibus):                    nan   Jarque-Bera (JB):                1.721
Skew:                           1.302   Prob(JB):                        0.423
Kurtosis:                       3.330   Cond. No.                     9.46e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Finally, you can add a trendline to a scatter plot:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')
ax = df.plot.scatter('inc', 'pop')
z = np.polyfit(df['inc'], df['pop'], 1)
p = np.poly1d(z)
df['trend'] = p(df.inc)
df.plot(x='inc', y='trend', ax=ax)
plt.show()

to get (which looks weird because I'm only using your 5 data points):

enter image description here

And also get the resulting line equation:

"y=%.6fx+(%.6f)" % (z[0], z[1])
y=-0.122779x+(49032.076720)

Upvotes: 1

viraptor
viraptor

Reputation: 34205

You can do various things to make it more readable. I guess with just this line you get a line plot. You can change it to scatter first.

If you're trying to show some correlation, you can overlay a regression line.

If that's too messy, you can play with colours and for example make the points light grey, but the regression line red.

Check out http://pandas.pydata.org/pandas-docs/stable/visualization.html for inspiration. Specifically check out the examples using GeomScatter - the bill/tips one seems to be close to what you could do.

Upvotes: 0

andrew
andrew

Reputation: 4089

By default, the plot kind parameter is line. For exploratory data analysis, it is often better to start with scatter plots.

df.plot(x='inc', y='pop', kind='scatter')

Upvotes: 1

Related Questions