Reputation: 2273
I have data
city inc pop
New-York 29343,00 8683,00
Moscow 25896,00 17496,00
Boston 21785,00 15063,00
Berlin 20000,00 70453,00
London 44057,00 57398,00
Rome 24000,00 104831,00
I need to find how inc
dependence from pop
.
I try to plot graph df.plot(x='inc', y='pop')
but I awful graph, because i have 200 values.
How can I do it better?
Upvotes: 0
Views: 2223
Reputation: 42905
As noted, you get the correlation
via:
df['inc'].corr(df['pop'])
-0.0279628856838
If you want a linear regression, you can use statsmodels.ols:
import statsmodels.api as sm
df['const'] = 1
model = sm.OLS(df['inc'], df[['const', 'pop']])
results = model.fit()
results.summary()
which yields:
OLS Regression Results
==============================================================================
Dep. Variable: inc R-squared: 0.001
Model: OLS Adj. R-squared: -0.249
Method: Least Squares F-statistic: 0.003130
Date: Tue, 21 Jun 2016 Prob (F-statistic): 0.958
Time: 07:29:55 Log-Likelihood: -62.413
No. Observations: 6 AIC: 128.8
Df Residuals: 4 BIC: 128.4
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
const 2.78e+04 6548.318 4.246 0.013 9623.205 4.6e+04
pop -0.0064 0.114 -0.056 0.958 -0.322 0.310
==============================================================================
Omnibus: nan Durbin-Watson: 2.613
Prob(Omnibus): nan Jarque-Bera (JB): 1.721
Skew: 1.302 Prob(JB): 0.423
Kurtosis: 3.330 Cond. No. 9.46e+04
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Finally, you can add a trendline to a scatter plot:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')
ax = df.plot.scatter('inc', 'pop')
z = np.polyfit(df['inc'], df['pop'], 1)
p = np.poly1d(z)
df['trend'] = p(df.inc)
df.plot(x='inc', y='trend', ax=ax)
plt.show()
to get (which looks weird because I'm only using your 5 data points):
And also get the resulting line equation:
"y=%.6fx+(%.6f)" % (z[0], z[1])
y=-0.122779x+(49032.076720)
Upvotes: 1
Reputation: 34205
You can do various things to make it more readable. I guess with just this line you get a line plot. You can change it to scatter first.
If you're trying to show some correlation, you can overlay a regression line.
If that's too messy, you can play with colours and for example make the points light grey, but the regression line red.
Check out http://pandas.pydata.org/pandas-docs/stable/visualization.html for inspiration. Specifically check out the examples using GeomScatter
- the bill/tips one seems to be close to what you could do.
Upvotes: 0
Reputation: 4089
By default, the plot kind
parameter is line. For exploratory data analysis, it is often better to start with scatter plots.
df.plot(x='inc', y='pop', kind='scatter')
Upvotes: 1