Reputation: 945
I am using RegexTokenizer
and StopWordsRemover
to tokenize
my data set for model building. In the same time I want to remove words of less than 3 letters. Also http
and https
. How can I do that? Here is my code:
`
val trainDF = sqlContext.read.jdbc(url, table, prop)
// Tokenize
val tokenizer = new RegexTokenizer()
.setGaps(false)
.setPattern("\\p{L}+")
.setInputCol("posttext")
.setOutputCol("words")
val tokenizedDF = tokenizer.transform(trainDF)
val filterer = new StopWordsRemover()
.setCaseSensitive(false)
.setInputCol("words")
.setOutputCol("tokens")
val filteredDF = filterer.transform(tokenizedDF)`
Upvotes: 2
Views: 863