Reputation: 516
So far, I pre-process text data using numpy and build-in fuctions (such as keras tokenizer class, tf.keras.preprocessing.text.Tokenizer: https://keras.io/api/preprocessing/text/).
And there is were I got stuck: Since I am trying to scale up my model and data set, I am experimenting with spark and spark nlp (https://nlp.johnsnowlabs.com/docs/en/annotators#tokenizer)... however, I couldn´t yet find a similar working tokenizer. The fitted tokenizer must be later available to transform validation/new data.
My output should represent each token as an unique integer value (starting from 1), something like:
[ 10,... , 64, 555]
[ 1,... , 264, 39]
[ 12,..., 1158, 1770]
Currently, I was able to use the Spark NLP-tokenizer to obtain tokenized words:
[okay,..., reason, still, not, get, background]
[picture,..., expand, fill, whole, excited]
[not, worry,..., happy, well, depend, on, situation]
Does anyone have a solution which doesn´t require to copy the data out of the spark environment?
UPDATE:
I created two CSVs to clarify my current issue. The first file was created thru a pre-processing pipeline: 1. cleaned_delim_text
After that, the delimited words should be "translated" to integer values and the sequence should be padded with zeros to the same length: 2. cleaned_tok_text
Upvotes: 1
Views: 546