Reputation: 375
I've a Dataproc cluster with 2 worker nodes. My pyspark program is very simple
1) Reads a 500MB data from Bigquery 2) Apply a few UDFs 3) Display results from pyspark SQL dataframe based on some condition
At the third step the jobs gets stuck at stage 0 and does nothing. I'm new to Pyspark but I don't think so the data is huge for it to get hanged. Please help me.
@Adam,
My UDF is from RDkit library. Is it possible to make the UDF efficient so the output is in seconds?
from rdkit import Chem
user_smile_string = 'ONC(=O)c1ccc(I)cc1'
mol = Chem.MolFromSmiles(user_smile_string)
def Matched(smile_structure):
try:
match = mol.HasSubstructMatch(Chem.MolFromSmiles(smile_structure))
except Exception:
pass
else:
return (match)
Upvotes: 3
Views: 1456
Reputation: 4457
As mentioned in the comments, you need to troubleshoot your job to understand what's happening.
You can start from exploring job driver output, job logs and Spark job DAG that are accessible from Google Cloud UI.
If this will not yield any useful information, then you need to enable debug logging in Spark and go from there.
Upvotes: 1