Pandas UDF (PySpark) - Incorrect type Error

Question

I'm trying entity extraction with spaCy and Pandas UDF (PySpark) but I get an error.
Using a UDF works without errors but is slow. What am I doing wrong?

Loading the model every time is to avoid load error - Can't find model 'en_core_web_lg'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

Working UDF:

def __get_entities(x):

    global nlp
    nlp = spacy.load("en_core_web_lg")
    ents=[]

    doc = nlp(x)

    for ent in doc.ents:
        if ent.label_ == 'PERSON' OR ent.label_ == 'ORG':
            ents.append(ent.label_)

    return ents

get_entities_udf = F.udf(__get_entities), T.ArrayType(T.StringType()))

Pandas UDF with error:

def __get_entities(x):

    global nlp
    nlp = spacy.load("en_core_web_lg")
    ents=[]

    doc = nlp(x)

    for ent in doc.ents:
        if ent.label_ == 'PERSON' OR ent.label_ == 'ORG':
            ents.append(ent.label_)

    return pd.Series(ents)


get_entities_udf = F.pandas_udf(lambda x: __get_entities(x), "array", F.PandasUDFType.SCALAR)

Error message:

TypeError: Argument 'string'has incorrect type (expected str, got series)

Sample Spark DataFrame:

df = spark.createDataFrame([
  ['John Doe'],
  ['Jane Doe'],
  ['Microsoft Corporation'],
  ['Apple Inc.'],
]).toDF("name",)

New column:

df_new = df.withColumn('entity',get_entities_udf('name'))

tourist · Accepted Answer

You need to see the input as pd.Series instead of single value

I was able to get it working by refactoring the code a bit. Notice x.apply call which is pandas specific and applies function to a pd.Series.

def entities(x):
    global nlp
    import spacy
    nlp = spacy.load("en_core_web_lg")
    ents=[]

    doc = nlp(x)

    for ent in doc.ents:
        if ent.label_ == 'PERSON' or ent.label_ == 'ORG':
            ents.append(ent.label_)
    return ents


def __get_entities(x):
    return x.apply(entities)

get_entities_udf = pandas_udf(lambda x: __get_entities(x), "array", PandasUDFType.SCALAR)

df_new = df.withColumn('entity',get_entities_udf('name'))

df_new.show()

+--------------------+--------+
|                name|  entity|
+--------------------+--------+
|            John Doe|[PERSON]|
|            Jane Doe|[PERSON]|
|Microsoft Corpora...|   [ORG]|
|          Apple Inc.|   [ORG]|
+--------------------+--------+

Pandas UDF (PySpark) - Incorrect type Error

Answers (2)

Setting things up

As it is

Batch processing

broadcasting the `nlp` object

The complete Code

Related Questions

Pandas UDF (PySpark) - Incorrect type Error

Answers (2)

Setting things up

As it is

Batch processing

broadcasting the nlp object

The complete Code

Related Questions

broadcasting the `nlp` object