LP-AC-J
LP-AC-J

Reputation: 33

Regular Expression - Spark scala DataSet

I want to get tokens from tweets.

To achieve this I use RegexTokenizer of Spark 2.0 and scala. My problem is to achieve the pattern I want.

I have these tweets:

0) "#oscars https://w.r/123f5"
1) "#oscars! go leo!"
2) "#oscars: did it!"

And I want to have the tokens:

0) (#oscars, https://w.r/123f5)
1) (#oscars, go, leo)
2) (#oscars, did, it)

That is, if the tweet has word "#oscar." or "#oscar!" or #oscar: ", I want the token to be:" #oscar " At the same time if the tweet has word "leo!" or "it" I want the token to be:"leo" or "it".

I don't want to disarm urls!

I try :

val sentenceDataFrame = spark.createDataFrame(Seq(
  (0, "#oscars https://w.r/123f5"),
  (1, "#oscars! go leo!"),
  (2, "#oscars: he did it! ")
)).toDF("label", "sentence")

val regextokenizer = new RegexTokenizer()
  .setGaps(false)
  .setPattern("\\p{L}+")
  .setInputCol("text")
  .setOutputCol("words")

val regexTokenized = regexTokenizer.transform(sentenceDataFrame)

But it doesn't works well. I get:

(oscars, https, w, r, 123f5)
(oscars, go, leo)
(oscars, he, did, it)

Upvotes: 0

Views: 755

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626689

Inside setPattern, use

"(?U)\\bhttps?://\\S*|#?\\b\\w+\\b

See the regex demo.

Details: the regex matches URLs with \\bhttps?://\\S* and, with #?\\b\\w+\\b, hashtags or words.

  • (?U) - make \b and \w to be Unicode aware
  • \\b - a leading word boundary
  • https? - http or https
  • :// - :// literal char sequence
  • \\S* - 0+ non-whitespace symbols
  • | - or
  • #? - 1 or 0 #s
  • \\b\\w+\\b - a whole word, 1+ word chars (Unicode aware) within word boundaries.

Upvotes: 1

Related Questions