Pablosss
Pablosss

Reputation: 1

PySpark : Mapping words by using Tokenizer

I am starting my journey with PySpark and I have stuck in one point for ex.: I have code like this : ( I took it from https://spark.apache.org/docs/2.1.0/ml-features.html )

from pyspark.ml.feature import Tokenizer, RegexTokenizer
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType

sentenceDataFrame = spark.createDataFrame([
    (0, "Hi I heard about Spark"),
    (1, "I wish Java could use case classes"),
    (2, "Logistic,regression,models,are,neat")
], ["id", "sentence"])

tokenizer = Tokenizer(inputCol="sentence", outputCol="words")

regexTokenizer = RegexTokenizer(inputCol="sentence", outputCol="words", pattern="\\W")
# alternatively, pattern="\\w+", gaps(False)

countTokens = udf(lambda words: len(words), IntegerType())

tokenized = tokenizer.transform(sentenceDataFrame)
tokenized.select("sentence", "words")\
    .withColumn("tokens", countTokens(col("words"))).show(truncate=False)

regexTokenized = regexTokenizer.transform(sentenceDataFrame)
regexTokenized.select("sentence", "words") \
    .withColumn("tokens", countTokens(col("words"))).show(truncate=False)

And I am adding something like this:

test = sqlContext.createDataFrame([
    (0, "spark"),
    (1, "java"),
    (2, "i")
], ["id", "word"])

Output is:

id |sentence                           |words                                     |tokens|
+---+-----------------------------------+------------------------------------------+------+
|0  |Hi I heard about Spark             |[hi, i, heard, about, spark]              |5     |
|1  |I wish Java could use case classes |[i, wish, java, could, use, case, classes]|7     |
|2  |Logistic,regression,models,are,neat|[logistic, regression, models, are, neat] |5     |

Am I possible to achieve something like this : [Id from 'test', Id from 'regexTokenized']

2, 0
2, 1
1, 1
0, 1

From list from 'test' I can grap ID's from 'regexTokenized' where tokenized 'words' can be mapped among both datasets? Or maybe another solution should be taken?

In advance thank You in any help :)

Upvotes: 0

Views: 2306

Answers (1)

Alper t. Turker
Alper t. Turker

Reputation: 35249

explode and join:

 from pyspark.sql.functions import explode

(testTokenized.alias("train")
    .select("id", explode("words").alias("word"))
    .join(
        trainTokenized.select("id", explde("words").alias("word")).alias("test"), 
        "word"))

Upvotes: 1

Related Questions