Dr.
Dr.

Reputation: 79

sns.pairplot returns bad results for Kmeans cluster visualizations

#import libraries
import pandas as pd
import numpy as np
import random as rd
import matplotlib.pyplot as plt
import seaborn as sns

data = pd.read_csv('C:/Users/yehya/Desktop/cmps276/forestfires.csv')
data = pd.get_dummies(data)

#Visualise data points

sns.pairplot(data)
sns.plt.show()
#plt.show()

I'm trying to run a simple scatterplot using sns.pairplot, my end goal is applying Kmeans cluster on my data. But I want to visualize my data. before applying anything I wanted to use a scatterplot. using the above code the results I got were these scatterplot result. the data consists of 13 columns and about 450 rows. I'm new to these data manipulation algorithms and visualizations, I'm not sure I'm approaching this problem in the correct way. what might be a better way to visualize my data? the target column is Area. ill leave a link to the dataset which can be found on Kaggle https://www.kaggle.com/elikplim/forest-fires-data-set, forestfire. Help would be appreciated thanks

Upvotes: 1

Views: 660

Answers (1)

StupidWolf
StupidWolf

Reputation: 46958

Some of your columns are categorical, although you onehot encode them, plotting them using a scatterplot will not make much sense:

import pandas as pd
import numpy as np
import seaborn as sns

data = pd.read_csv('./forestfires.csv')
data.dtypes

X          int64
Y          int64
month     object
day       object
FFMC     float64
DMC      float64
DC       float64
ISI      float64
temp     float64
RH         int64
wind     float64
rain     float64
area     float64
dtype: object

If you plot the numerical columns first, it is ok:

num_cols = data.select_dtypes('number').columns.to_list()

num_cols
['X', 'Y', 'FFMC', 'DMC', 'DC', 'ISI', 'temp', 'RH', 'wind', 'rain', 'area']

sns.pairplot(data[num_cols])

enter image description here

You can visualize the categorical values using the plots shown in the seaborn documentation.

Upvotes: 2

Related Questions