Esmaeil zahedi
Esmaeil zahedi

Reputation: 381

Skipping some lines based on their length in Scala and Spark

I have a file which contains a lot of documents, how can i skip those lines that have length <= 2, and then process lines with length > 2. for example:

fit perfectly clie .
purchased not
instructions install helpful . improvement battery life not hoped .
product.
cable good not work . cable extremely hot not recognize devices .

after skipping lines:

fit perfectly clie .
instructions install helpful . improvement battery life not hoped .
cable good not work . cable extremely hot not recognize devices .

my code:

 val Bi = text.map(sen=> sen.split(" ").sliding(2))

Is there any solution for this?

Upvotes: 4

Views: 983

Answers (2)

Justin Pihony
Justin Pihony

Reputation: 67115

How about flatMap

text.flatMap(line=>{
  val tokenized = line.split(" ")
  if(tokenized.length > 2) Some(tokenized.sliding(2))
  else None
})

Upvotes: 2

mattsilver
mattsilver

Reputation: 4396

I'd use filter:

> val text = sc.parallelize(Array("fit perfectly clie .",
                                "purchased not",
                                "instructions install helpful . improvement battery life not hoped .",
                                "product.",
                                "cable good not work . cable extremely hot not recognize devices ."))

> val result = text.filter{_.split(" ").size > 2}
> result.collect.foreach{println}

fit perfectly clie .
instructions install helpful . improvement battery life not hoped .
cable good not work . cable extremely hot not recognize devices .

From here, you can work with your data in their original form (i.e. not tokenized) after filtering. If you'd prefer to tokenize first, then you can do this:

text.map{_.split(" ")}.filter{_.size > 2}

So, finally, to tokenize, then filter, and then find bigrams with sliding, you'd use:

text.map{_.split(" ")}.filter{_.size > 2}.map{_.sliding(2)}

Upvotes: 2

Related Questions