sopana
sopana

Reputation: 375

Pyspark job on Dataproc gets stuck at stage 0

I've a Dataproc cluster with 2 worker nodes. My pyspark program is very simple

1) Reads a 500MB data from Bigquery 2) Apply a few UDFs 3) Display results from pyspark SQL dataframe based on some condition

At the third step the jobs gets stuck at stage 0 and does nothing. I'm new to Pyspark but I don't think so the data is huge for it to get hanged. Please help me.

@Adam,

My UDF is from RDkit library. Is it possible to make the UDF efficient so the output is in seconds?

from rdkit import Chem

user_smile_string = 'ONC(=O)c1ccc(I)cc1' 
mol = Chem.MolFromSmiles(user_smile_string)

def Matched(smile_structure):
    try:
        match = mol.HasSubstructMatch(Chem.MolFromSmiles(smile_structure))
    except Exception:
        pass
    else:
        return (match)

Upvotes: 3

Views: 1456

Answers (1)

Igor Dvorzhak
Igor Dvorzhak

Reputation: 4457

As mentioned in the comments, you need to troubleshoot your job to understand what's happening.

You can start from exploring job driver output, job logs and Spark job DAG that are accessible from Google Cloud UI.

If this will not yield any useful information, then you need to enable debug logging in Spark and go from there.

Upvotes: 1

Related Questions