Igor K.
Igor K.

Reputation: 945

Spark DataFrame transformation - remove words of less than 3 letters

I am using RegexTokenizer and StopWordsRemover to tokenize my data set for model building. In the same time I want to remove words of less than 3 letters. Also http and https. How can I do that? Here is my code: `

val trainDF = sqlContext.read.jdbc(url, table, prop)

 // Tokenize
 val tokenizer = new RegexTokenizer()
    .setGaps(false)
    .setPattern("\\p{L}+")
    .setInputCol("posttext")
    .setOutputCol("words")
 val tokenizedDF = tokenizer.transform(trainDF)

 val filterer = new StopWordsRemover()
  .setCaseSensitive(false)
  .setInputCol("words")
  .setOutputCol("tokens")

 val filteredDF = filterer.transform(tokenizedDF)`

Upvotes: 2

Views: 863

Answers (1)

Igor K.
Igor K.

Reputation: 945

Found setMinTokenLength(3) in RegexTokenizer

Upvotes: 1

Related Questions