Reputation: 79
#import libraries
import pandas as pd
import numpy as np
import random as rd
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.read_csv('C:/Users/yehya/Desktop/cmps276/forestfires.csv')
data = pd.get_dummies(data)
#Visualise data points
sns.pairplot(data)
sns.plt.show()
#plt.show()
I'm trying to run a simple scatterplot using sns.pairplot, my end goal is applying Kmeans cluster on my data. But I want to visualize my data. before applying anything I wanted to use a scatterplot. using the above code the results I got were these . the data consists of 13 columns and about 450 rows. I'm new to these data manipulation algorithms and visualizations, I'm not sure I'm approaching this problem in the correct way. what might be a better way to visualize my data? the target column is Area. ill leave a link to the dataset which can be found on Kaggle https://www.kaggle.com/elikplim/forest-fires-data-set, forestfire. Help would be appreciated thanks
Upvotes: 1
Views: 660
Reputation: 46958
Some of your columns are categorical, although you onehot encode them, plotting them using a scatterplot will not make much sense:
import pandas as pd
import numpy as np
import seaborn as sns
data = pd.read_csv('./forestfires.csv')
data.dtypes
X int64
Y int64
month object
day object
FFMC float64
DMC float64
DC float64
ISI float64
temp float64
RH int64
wind float64
rain float64
area float64
dtype: object
If you plot the numerical columns first, it is ok:
num_cols = data.select_dtypes('number').columns.to_list()
num_cols
['X', 'Y', 'FFMC', 'DMC', 'DC', 'ISI', 'temp', 'RH', 'wind', 'rain', 'area']
sns.pairplot(data[num_cols])
You can visualize the categorical values using the plots shown in the seaborn documentation.
Upvotes: 2