Reputation: 521
EDITED TO SHOW EXAMPLE OF ORIGINAL DATAFRAME:
df.head(4)
shop category subcategory season
date
2013-09-04 abc weddings shoes winter
2013-09-04 def jewelry watches summer
2013-09-05 ghi sports sneakers spring
2013-09-05 jkl jewelry necklaces fall
I've successfully generated the following dataframe using get_dummies():
wedding_seasons = pd.get_dummies(df.loc[df['category']=='weddings',['category','season']],prefix = '', prefix_sep = '' )
wedding_seasons.head(3)
weddings winter summer spring fall
71654 1.0 0.0 1.0 0.0 0.0
72168 1.0 0.0 1.0 0.0 0.0
72080 1.0 0.0 1.0 0.0 0.0
The goal of the above is to help assess frequency of weddings across seasons, so I've used corr()
to generate the following result:
weddings fall spring summer winter
weddings NaN NaN NaN NaN NaN
fall NaN 1.000000 0.054019 -0.331866 -0.012122
spring NaN 0.054019 1.000000 -0.857205 0.072420
summer NaN -0.331866 -0.857205 1.000000 -0.484578
winter NaN -0.012122 0.072420 -0.484578 1.000000
I'm unsure why the wedding column is generating NaN values, but my gut feeling is that it originates from how I originally created wedding_seasons
. Any guidance would be greatly appreciated so that I can properly assess column correlations.
Upvotes: 2
Views: 3558
Reputation: 231
I don't think what you're interested in seeing here is the "correlation".
All of the columns in the dataframe wedding_seasons
contain floating point values; however, if my suspicions are correct, the rows in your original dataframe df
contain something like transaction records, where each row corresponds to an individual.
Please tell me if I'm incorrect, but I'll proceed with my reasoning.
Correlation will measure, intuitively, the tendency for values vary together/against each other within the same observation (e.g. if X and Y are negatively correlated, then when we see X go above its mean, we'd expect Y to appear below its mean).
However, what you have here is data where, if one transaction is summer
, then categorically it cannot possibly be winter
at the same time. When you create wedding_seasons
, Pandas is creating dummy variables that are treated as floating point values when computing your correlation matrix; since it's impossible for any row to contain two 1.0
entries at the same time, clearly your resulting correlation matrix is going to have negative entries everywhere.
Upvotes: 1
Reputation: 386
You could drop the weddings column before doing corr()
.
wedding_seasons.drop(columns = ['weddings'])
Upvotes: 0