AHF
AHF

Reputation: 1072

Finding the correlation between variables using python

I am trying to find the correlation of all the columns in this dataset excluding qualityand then plot the frequency distribution of wine quality.

I am doing it the following way, but how do I remove quality?

import pandas as pd
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv', sep=';')
df.corr()

It returns this output:

enter image description here

How can I graph the frequency distribution of wine quality with pandas?

I previously used R for correlation and it worked fine for me but on this dataset I am learning use of pandas and python:

winecor = cor(wine[-12])
hist(wine$quality)

So in R I am getting the following output and I am looking for same in Python.

enter image description here

enter image description here

Upvotes: 0

Views: 1010

Answers (2)

Arturo Sbr
Arturo Sbr

Reputation: 6333

1. Histogram

# Import plotting library
import matplotlib.pyplot as plt

### Option 1 - histogram
plt.hist(df['quality'], bins=range(3, 10))
plt.show()

### Option 2 - bar plot (looks nicer)
# Get frequency per quality group
x = df.groupby('quality').size()
# Plot
plt.bar(x.index, x.values)
plt.show()

2. Correlation matrix

In order to get the correlation matrix of features, excluding quality:

# Option 1 - very similar to R
df.iloc[:, :-1].corr()

# Option 2 - more Pythonic
df.drop('quality', axis=1).corr()

Upvotes: 2

BleakHeart
BleakHeart

Reputation: 51

You can plot histograms with:

import matplotlib.pyplot as plt 

plt.hist(x=df['quality'], bins=30)
plt.show()

Read the docs of plt.hist() in order to understand better all the attributes

Upvotes: 1

Related Questions