Reputation: 1815
I would like to apply spacy nlp on my pyspark dataframe. I am using map partitions concept on my pyspark dataframe to apply python logic that consists of spacy.
Spark version: 3.2.0
Below is the sample pyspark dataframe:
token id
0 [This, is, java, world] 0
1 [This, is, spark, world] 1
Below is the code where I am passing a data to the python function and returning a dictionary
def get_spacy_doc_parallel_map_func(partitionData):
import spacy
from tqdm import tqdm
import pandas as pd
nlp=spacy.load('en_core_web_sm')
nlp.tokenizer=nlp.tokenizer.tokens_from_list
from spacy.tokens import Doc
if not Doc.has_extension("text_id"):
Doc.set_extension("text_id", default=None)
columnNames = broadcasted_source_columns.value
partitionData = pd.DataFrame(partitionData, columns=columnNames)
'''
This function creates a mapper of review id and review spacy.doc.Doc type
'''
def get_spacy_doc_parallel(data):
text_tuples = []
dodo = data[['token','id']].drop_duplicates(['id'])
for _,i in dodo.iterrows():
text_tuples.append((i['token'],{'text_id':i['id']}))
doc_tuples = nlp.pipe(text_tuples, as_tuples=True,n_process=8,disable=['tagger','parser','ner'])
docsf = []
for doc, context in tqdm(doc_tuples,total=len(text_tuples)):
doc._.text_id = context["text_id"]
docsf.append(doc)
vv={}
for doc in docsf:
vv[doc._.text_id] = doc
return vv
id_spacy_doc_mapper = get_spacy_doc_parallel(partitionData)
partitionData['spacy_doc'] = id_spacy_doc_mapper
partitionData.reset_index(inplace=True)
partitionData_dict = partitionData.to_dict("index")
result = []
for key in partitionData_dict:
result.append(partitionData_dict[key])
return iter(result)
resultDF_tokens = data.rdd.mapPartitions(get_spacy_doc_parallel_map_func)
df = spark.createDataFrame(resultDF_tokens)
The issue I am getting here is that the length of dictionary values does not match with length of the dataframe. Below is the error
Error:
ValueError: Length of values (954) does not match length of index (1438)
Output:
{0: This is java word, 1: This is spark world }
The above dictionary is assigned as a column to the python dataframe after applying spacy (partitionData['spacy_doc'] = id_spacy_doc_mapper)
Upvotes: 0
Views: 560
Reputation: 687
I don't have enough experience with spacy to figure out what the intent is here and I'm very confused by the input and output because the input looks tokenized, but I'll take a stab at it and list my assumptions and the problems I ran into.
First off, I think Fugue can make this type of transformation significantly easier. It will use the underlying Spark UDF, pandas_udf, mapPartition, or mapInPandas depending what parameters you supply. The point is that Fugue will handle that boilerplate. For you, it seems you have Pandas DataFrame in (that part is clear), but the output is less clear. I think you are passing some iterable of list to make Spark happy, but I think Pandas DataFrame output might be simpler. I'm guessing here.
So first we set some stuff up. This is all native Python. The tokens_from_list
portion was removed from the original because it seems like the latest versions deprecated it. Shouldn't matter for the example.
import pandas as pd
from typing import List, Any, Iterable, Dict
import spacy
nlp=spacy.load('en_core_web_sm')
from spacy.tokens import Doc
if not Doc.has_extension("text_id"):
Doc.set_extension("text_id", default=None)
data = pd.DataFrame({"token": ["This is java world", "This is spark world"],
"id": [0, 1]})
and then you define your logic for one partition. I am assuming Pandas DataFrame in and Pandas DataFrame out, but Fugue can actually support many other types such as Pandas DataFrame in and Iterable[List] out. The important thing is just you annotate your logic so Fugue knows how to handle it. Note this code is still native Python. I edited the logic a bit to just get it to work. Again, I am pretty sure I butchered the logic somewhere, but the example can still work. I really couldn't find a way for the original to work (because I don't know spacy enough)
def get_spacy_doc(data: pd.DataFrame) -> pd.DataFrame:
text_tuples = []
dodo = data[['token','id']].drop_duplicates(['id'])
for _,i in dodo.iterrows():
text_tuples.append((i['token'],{'text_id':i['id']}))
doc_tuples = nlp.pipe(text_tuples, as_tuples=True,n_process=1,disable=['tagger','parser','ner'])
docsf = []
for doc, context in doc_tuples:
doc._.text_id = context["text_id"]
docsf.append(doc)
vv={}
for doc in docsf:
vv[doc._.text_id] = doc
id_spacy_doc_mapper = vv.copy()
data['space_doc'] = id_spacy_doc_mapper
return data
Now to bring this to Spark, all you have to do with Fugue is:
from fugue import transform
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sdf = spark.createDataFrame(data)
sdf = transform(sdf, get_spacy_doc, schema="*, space_doc:int", engine=spark)
sdf.show()
and the Fugue transform will handle it. This is to run on Spark, but you can also run on Pandas if you don't supply an engine like this:
df = transform(data, get_spacy_doc, schema="*, space_doc:int")
This allows you to test the logic clearly without relying on Spark. It will then work when you bring it to Spark. Schema is a requirement because it is a requirement for Spark.
On partitioning
The Fugue transform can take partitioning strategy. For example:
transform(df, func, schema="*", partition={"by":"col1"}, engine=spark)
but for this case, I don't think you partition on anything so you can just use the default partitions, which is what will happen.
On parallelization
You have this code like:
nlp.pipe(text_tuples, as_tuples=True,n_process=8,disable=['tagger','parser','ner'])
This is two-stage parallelism. The first stage is Spark mapping over partitions, and the second stage is this pipe operation (I assume). Two stage parallelism is an anti-pattern in distributed computing because the first stage will already occupy all the available resources on the cluster. The parallelism should be done on the partition level. If you do something like this, it's very common to run into resource deadlocks when the 2nd stage tries to occupy resources also. I would recommend setting the n_process=1
.
On tqdm
I may be wrong on this one but I don't think tqdm plays well with Spark because I don't think you can get a real time progress bar for work that happens on worker machines. It can only work on the driver machine. The workers don't send logs to the driver for the functions it runs.
If the example is clearer, I can certainly help you port this logic to Spark. Feel free to reach out. I hope at least some bit of this was useful.
Upvotes: 1