doom4
doom4

Reputation: 695

Seaborn pairplot: Lost data in map_lower when using hue

when I define hue to color my plot, map_lower calls its function more often and looses data compared to the equivalent call without hue. Is this a bug or do I make a mistake?

Please see code below

import matplotlib.pyplot as plt
import pandas as pd
from scipy import stats
import seaborn as sns


def corrfunc(x, y, **kws):
    r, _ = stats.pearsonr(x, y)
    print(x)
    print(y)
    print(r)

iris = sns.load_dataset("iris")
seax = sns.pairplot(iris, size=2, vars=["petal_width", "petal_length", "sepal_width"])
seax.map_lower(corrfunc)
plt.show()

If you change

sns.pairplot(iris, size=2, vars=["petal_width", "petal_length", "sepal_width"])

to

seax = sns.pairplot(iris, hue="sepal_length", size=2, vars=["petal_width", "petal_length", "sepal_width"])

the code is broken but the plot looks good. So if you run the code without hue corrfunc is called 3 times for the 3 plots in lower. If I add hue="class" to color the plot by the field class the corrfunc is called by lower 8 times or so. I dont understand why coloring with hue has an effect on map_lower.

Upvotes: 0

Views: 1185

Answers (2)

doom4
doom4

Reputation: 695

So maybe one day this will help somebody who wants to do what I had in mind. Here is my ugly but working solution:

#!/usr/bin/env python
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns

# Global variables to keep track of data chunks if you
# use hue to color the data points. map_lower will with
# hue group data in chunks of identical hue values

dataLength = xName = yName = xData = yData = ''


# Function to group data pairs to plot their correlation
def assemble_data_subplot(x, y, **kwargs):
    global xName, yName, xData, yData, dataLength
    if xName == '' and yName == '':
        xName = x.name
        yName = y.name
        xData = x
        yData = y
    elif xName == x.name and yName == y.name:
        xData = xData.append(x)
        yData = yData.append(y)

    if len(xData) == dataLength:
        correlate_data(xData, yData)
        xName = yName = xData = yData = ''


# Correlation function
def correlate_data(xData, yData):
    r, _ = stats.pearsonr(xData, yData)
    r = r**2
    sax = plt.gca()
    sax.annotate("$r^2$={:.2f}".format(r),
                 xy=(.02, .86),
                 xycoords=sax.transAxes)


# Main function to plot the pairwise correlation plot
def main():
    # Init global variable to set it later
    global dataLength

    # Path to CSV file and data frame builder
    df = sns.load_dataset("iris")

    # Example without hue
    g = sns.pairplot(df, size=2, hue="petal_width",
                     vars=["petal_width",
                           "petal_length",
                           "sepal_width"])

    # Get the number of data entries to check when the assembled data
    # is complete. Used in assemble_data_subplot
    dataLength = len(df)

    # Plot the r^2 value on the lower part of the pair plot
    g.map_lower(assemble_data_subplot)

    # Generate the output
    g.savefig("output.png")
    plt.show()


if __name__ == "__main__":
    main()

Upvotes: 1

error
error

Reputation: 2471

When looking at the code defining map_lower we see the following piece of code (I left out quite a few bits to be more concise)(left out bits were not relevant to the answer):

def map_lower(self, func, **kwargs):
        #irrelevant  parts left out
        for k, label_k in enumerate(self.hue_names):

            #some more irrelevant parts (specifying colours and what not)

            func(data_k[x_var], data_k[y_var], label=label_k,
                 color=color, **kwargs)

    return self

So basically for every unique hue value that is present the func that is given to map.lower will be run (for each variable).

When no hue is given the func will only be run once on all the relevant data (for each variable). Hence the difference between using hue and not using it in regards to the amount of calls to func.

Upvotes: 0

Related Questions