Finding the correlation between variables using python

Question

I am trying to find the correlation of all the columns in this dataset excluding qualityand then plot the frequency distribution of wine quality.

I am doing it the following way, but how do I remove quality?

import pandas as pd
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv', sep=';')
df.corr()

It returns this output:

How can I graph the frequency distribution of wine quality with pandas?

I previously used R for correlation and it worked fine for me but on this dataset I am learning use of pandas and python:

winecor = cor(wine[-12])
hist(wine$quality)

So in R I am getting the following output and I am looking for same in Python.

Arturo Sbr · Accepted Answer

1. Histogram

# Import plotting library
import matplotlib.pyplot as plt

### Option 1 - histogram
plt.hist(df['quality'], bins=range(3, 10))
plt.show()

### Option 2 - bar plot (looks nicer)
# Get frequency per quality group
x = df.groupby('quality').size()
# Plot
plt.bar(x.index, x.values)
plt.show()

2. Correlation matrix

In order to get the correlation matrix of features, excluding quality:

# Option 1 - very similar to R
df.iloc[:, :-1].corr()

# Option 2 - more Pythonic
df.drop('quality', axis=1).corr()

Finding the correlation between variables using python

Answers (2)

1. Histogram

2. Correlation matrix

Related Questions