Reputation: 30332
I am working through some statistics examples using Scitkit-learn (0.20.0), and trying to plot some things as I go with Seaborn (0.9.0). I keep encountering errors when I try to plot data sets I've combined using Pandas' concat()
function.
Here is the most minimal example I could construct:
import numpy
import pandas
import seaborn
X = numpy.array([[-1, -1, "A"]])
P = numpy.array([[-0.8, -1]])
data_x = pandas.DataFrame(X, columns=('x','y','group'))
data_p = pandas.DataFrame(P, columns=('x','y'))
data_p['group'] = "B"
combined = pandas.concat([data_x, data_p], ignore_index=True, sort=True)
seaborn.scatterplot(data=combined, x='x', y='y')
This results in a traceback ending in:
TypeError: -0.8 is not a string
If I remove the 'A'
and 'group'
columns, there's no error. If I plot data_x
or data_p
separately, there's no error. But I'm using Seaborn to plot the results of supervised classification exercises, so having eg. columns for the 2D data plus category columns for grouping (eg. group
is A
or B
differentiated by hue) and whether something was known or predicted (eg. kind
is known
or predicted
differentiated by style) is very useful.
Hence I don't want to drop category columns just to avoid the errors here.
What am I doing wrong?
Upvotes: 4
Views: 978
Reputation: 4233
When you construct a numpy array with a string, all other values in the array will also be treated as objects.
X = numpy.array([[-1, -1, "A"]])
print (X)
array([['-1', '-1', 'A']], dtype='<U11')
P = numpy.array([[-0.8, -1]])
array([[-0.8, -1. ]]) ## Remains as float.
So, constructing a dataframe with array X
will results in a dataframe where all columns are objects where as data_p
will remain float.
data_x = pandas.DataFrame(X, columns=('x','y','group'))
print (data_x.dtypes)
x object
y object ## object dtypes
group object
dtype: object
data_p = pandas.DataFrame(P, columns=('x','y'))
data_p['group'] = "B"
print (data_p.dtypes)
x float64
y float64 ## Here x and y remains as float.
group object
dtype: object
Now, when you concat
both dataframes, Here x
and y
columns being object in one and float in another will default to object dtype
in combined
.
combined = pandas.concat([data_x, data_p], ignore_index=True, sort=True)
print (combined.dtypes)
group object
x object
y object
dtype: object
So the reason for TypeError
is due to the resulting columns x & y
being object dtype.
Scatter plot requires numeric columns for plotting.
combined = combined.apply(pd.to_numeric, errors='ignore') ## Convert to numeric
group object
x float64
y float64
dtype: object
seaborn.scatterplot(data=combined, x='x', y='y')
Upvotes: 3
Reputation: 212
When you create your data like that, all elements in X array are treated as objects. You can see it when you print data_x.info()
.
To avoid it you can make sure that x
and y
in your primary DataFrames are of numerical type while generating data (I assume here you just have an example). This solution is recommended.
If from any reason impossible, you can do it afterwards, e.g.
combined['x'] = combined['x'].astype('int')
combined['y'] = combined['y'].astype('int')
Upvotes: 1