Christine
Christine

Reputation: 35

Unable to plot a Pandas dataframe in Jupyter notebook

I am coding in a Jupyter notebook that I opened through a GCP cluster. I am reading data in from BigQuery using the Spark-BigQuery connector. I'm trying to take a subset of this data and plot it, but whenever I try to run the command, the kernel disconnects/reconnects. This has happened before in places where I was doing something wrong and hadn't noticed (so I know that it isn't just disconnecting at random). But in this case, I really have no idea what I'm doing wrong. What I'm doing is very similar to the following tutorial on GitHub. I read the data to a Spark Dataframe. Then I convert the dataframe into a Pandas dataframe and try to plot it. This is where the error occurs. I've experimented with different sized subsets, so I know this isn't happening because my dataset is too big. I've also tried creating a "test" dataframe with random numbers and plotting that - it works perfectly. So it has to be a problem with my dataset...I'm just not sure what. Code below:

Reading the data in:

import pandas as pd
import numpy as np
from pyspark.sql import SparkSession

spark = SparkSession.builder \
  .appName('Jupyter BigQuery Storage')\
  .config('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar') \
  .getOrCreate()

table = "bigquery-public-data.ncaa_basketball.mbb_pbp_sr"
df = spark.read \
  .format("bigquery") \
  .option("table", table) \
  .load()
df.printSchema()

df.createOrReplaceTempView('df')

query_string = """
    SELECT event_type,
    season,
    type,
    team_alias,
    team_market,
    team_name,
    team_basket,
    event_id,
    event_coord_x,
    event_coord_y,
    three_point_shot,
    shot_made
    FROM df
    WHERE type = "fieldgoal"
        AND event_coord_x IS NOT NULL
        AND event_coord_y IS NOT NULL
    ORDER BY season
"""

df_shots = spark.sql(query_string)
df_shots.orderBy("season", "event_id").toPandas().head(5)

import matplotlib.pyplot as plt
%matplotlib inline

df_test = df_shots.toPandas()

test_new.plot(x='event_coord_x',y='event_coord_y',kind='line',figsize=(12,6))

The output for the last part is just:

<matplotlib.axes._subplots.AxesSubplot at 0x7f355a732950>

And then the kernel disconnects/reconnects. For reference, both event_coord_x and event_coord_y are of type float64. I don't see why that would cause any problems, but I even tried converting them to integers and plotting and the issue still arises.

I have a feeling that this may be something really trivial, but right now I'm stumped. Sorry I don't have anything specific like an error message (because there isn't one). Any suggestions would be immensely helpful.

Upvotes: 1

Views: 263

Answers (1)

aga
aga

Reputation: 3883

When using Cloud Dataproc 1.5 image version, the kernel appears to die and restart, while plotting the figure. It can be seen in logs from Jupyter. The problem is connected to Apache Knox, which is used by Cloud Dataproc cluster.

Knox limits websocket message size to the buffer size, and it’s insufficient for some Jupyter interactions. This should be fixed in the next image release.

For now, the workaround is to use Cloud Dataproc 1.4 image version or changing the figsize parameter to smaller values.

Upvotes: 1

Related Questions