Mr. Jibz
Mr. Jibz

Reputation: 521

How to remove NaN values from corr() function output

EDITED TO SHOW EXAMPLE OF ORIGINAL DATAFRAME:

df.head(4)

            shop   category  subcategory     season
date                
2013-09-04  abc    weddings  shoes           winter
2013-09-04  def    jewelry   watches         summer
2013-09-05  ghi    sports    sneakers        spring
2013-09-05  jkl    jewelry   necklaces       fall

I've successfully generated the following dataframe using get_dummies():

wedding_seasons = pd.get_dummies(df.loc[df['category']=='weddings',['category','season']],prefix = '', prefix_sep = '' )

wedding_seasons.head(3)

        weddings    winter  summer  spring  fall
71654   1.0         0.0     1.0     0.0     0.0
72168   1.0         0.0     1.0     0.0     0.0
72080   1.0         0.0     1.0     0.0     0.0

The goal of the above is to help assess frequency of weddings across seasons, so I've used corr() to generate the following result:

         weddings   fall       spring     summer       winter
weddings NaN        NaN        NaN        NaN          NaN
fall     NaN        1.000000   0.054019   -0.331866    -0.012122
spring   NaN        0.054019   1.000000   -0.857205    0.072420
summer   NaN        -0.331866  -0.857205  1.000000     -0.484578
winter   NaN        -0.012122  0.072420   -0.484578    1.000000

I'm unsure why the wedding column is generating NaN values, but my gut feeling is that it originates from how I originally created wedding_seasons. Any guidance would be greatly appreciated so that I can properly assess column correlations.

Upvotes: 2

Views: 3558

Answers (2)

Kevin Li
Kevin Li

Reputation: 231

I don't think what you're interested in seeing here is the "correlation".

All of the columns in the dataframe wedding_seasons contain floating point values; however, if my suspicions are correct, the rows in your original dataframe df contain something like transaction records, where each row corresponds to an individual.

Please tell me if I'm incorrect, but I'll proceed with my reasoning.

Correlation will measure, intuitively, the tendency for values vary together/against each other within the same observation (e.g. if X and Y are negatively correlated, then when we see X go above its mean, we'd expect Y to appear below its mean).

However, what you have here is data where, if one transaction is summer, then categorically it cannot possibly be winter at the same time. When you create wedding_seasons, Pandas is creating dummy variables that are treated as floating point values when computing your correlation matrix; since it's impossible for any row to contain two 1.0 entries at the same time, clearly your resulting correlation matrix is going to have negative entries everywhere.

Upvotes: 1

David Gaertner
David Gaertner

Reputation: 386

You could drop the weddings column before doing corr().

wedding_seasons.drop(columns = ['weddings'])

Upvotes: 0

Related Questions