Reputation: 799
I get my data from an SQL query from the table to my pandas Dataframe. The data looks like:
group phone_brand
0 M32-38 小米
1 M32-38 小米
2 M32-38 小米
3 M29-31 小米
4 M29-31 小米
5 F24-26 OPPO
6 M32-38 酷派
7 M32-38 小米
8 M32-38 vivo
9 F33-42 三星
10 M29-31 华为
11 F33-42 华为
12 F27-28 三星
13 M32-38 华为
14 M39+ 艾优尼
15 F27-28 华为
16 M32-38 小米
17 M32-38 小米
18 M39+ 魅族
19 M32-38 小米
20 F33-42 三星
21 M23-26 小米
22 M23-26 华为
23 M27-28 三星
24 M29-31 小米
25 M32-38 三星
26 M32-38 三星
27 F33-42 三星
28 M32-38 三星
29 M32-38 三星
... ... ...
74809 M27-28 华为
74810 M29-31 TCL
Now I want to find the correlation and the frequency from these two columns and put this in a visualization with Matplotlib. I tried something like:
DataFrame.plot(style='o')
plt.show()
Now how can I visualize this correlation in the simplest way?
Upvotes: 12
Views: 11742
Reputation: 93
Apart from the method piRSquared very clearly explained, you can use LabelEncoder
which transforms the values into numeric form in order to make sure that the machine interprets the features correctly.
#Import label encoder
from sklearn.preprocessing import LabelEncoder
#label_encoder object
le = LabelEncoder()
#Fit label encoder and return encoded labels
df['group'] = le.fit_transform(df['group'])
df['phone_brand'] = le.fit_transform(df['phone_brand'] )
#Finding correlation
df.corr()
#output for first 10 rows
group phone_brand
group 1.00000 0.67391
phone_brand 0.67391 1.00000
After applying LabelEncoder
, our DataFrame converted from this
group phone_brand
0 M32-38 小米
1 M32-38 小米
2 M32-38 小米
3 M29-31 小米
4 M29-31 小米
5 F24-26 OPPO
6 M32-38 酷派
7 M32-38 小米
8 M32-38 vivo
9 F33-42 三星
10 M29-31 华为
to this
group phone_brand
0 3 4
1 3 4
2 3 4
3 2 4
4 2 4
5 0 0
6 3 5
7 3 4
8 3 1
9 1 2
10 2 3
For multiple columns, you can go through the answers.
Upvotes: 0
Reputation: 1
Use pandas.factorize() method which can get the numeric representation of an array by identifying distinct values.
Upvotes: 0
Reputation: 294218
To quickly get a correlation:
df.apply(lambda x: x.factorize()[0]).corr()
group phone_brand
group 1.000000 0.427941
phone_brand 0.427941 1.000000
Heat map
import seaborn as sns
sns.heatmap(pd.crosstab(df.group, df.phone_brand))
Upvotes: 20