Bennimi
Bennimi

Reputation: 516

Spark equivalent to Keras Tokenizer?

So far, I pre-process text data using numpy and build-in fuctions (such as keras tokenizer class, tf.keras.preprocessing.text.Tokenizer: https://keras.io/api/preprocessing/text/).

And there is were I got stuck: Since I am trying to scale up my model and data set, I am experimenting with spark and spark nlp (https://nlp.johnsnowlabs.com/docs/en/annotators#tokenizer)... however, I couldn´t yet find a similar working tokenizer. The fitted tokenizer must be later available to transform validation/new data.

My output should represent each token as an unique integer value (starting from 1), something like:

[ 10,... ,  64,  555]
[ 1,... , 264,   39]
[ 12,..., 1158, 1770]

Currently, I was able to use the Spark NLP-tokenizer to obtain tokenized words:

[okay,..., reason, still, not, get, background] 
[picture,..., expand, fill, whole, excited]                     
[not, worry,..., happy, well, depend, on, situation]

Does anyone have a solution which doesn´t require to copy the data out of the spark environment?

UPDATE:

I created two CSVs to clarify my current issue. The first file was created thru a pre-processing pipeline: 1. cleaned_delim_text

After that, the delimited words should be "translated" to integer values and the sequence should be padded with zeros to the same length: 2. cleaned_tok_text

Upvotes: 1

Views: 546

Answers (1)

Som
Som

Reputation: 6338

Please try below combination -

1. Use tokenizer to convert the statements into words and then

2.use word2vec to compute distributed vector representation of those words

Upvotes: 0

Related Questions