Reputation: 381
I have a file which contains a lot of documents, how can i skip those lines that have length <= 2, and then process lines with length > 2. for example:
fit perfectly clie .
purchased not
instructions install helpful . improvement battery life not hoped .
product.
cable good not work . cable extremely hot not recognize devices .
after skipping lines:
fit perfectly clie .
instructions install helpful . improvement battery life not hoped .
cable good not work . cable extremely hot not recognize devices .
my code:
val Bi = text.map(sen=> sen.split(" ").sliding(2))
Is there any solution for this?
Upvotes: 4
Views: 983
Reputation: 67115
How about flatMap
text.flatMap(line=>{
val tokenized = line.split(" ")
if(tokenized.length > 2) Some(tokenized.sliding(2))
else None
})
Upvotes: 2
Reputation: 4396
I'd use filter:
> val text = sc.parallelize(Array("fit perfectly clie .",
"purchased not",
"instructions install helpful . improvement battery life not hoped .",
"product.",
"cable good not work . cable extremely hot not recognize devices ."))
> val result = text.filter{_.split(" ").size > 2}
> result.collect.foreach{println}
fit perfectly clie .
instructions install helpful . improvement battery life not hoped .
cable good not work . cable extremely hot not recognize devices .
From here, you can work with your data in their original form (i.e. not tokenized) after filtering. If you'd prefer to tokenize first, then you can do this:
text.map{_.split(" ")}.filter{_.size > 2}
So, finally, to tokenize, then filter, and then find bigrams with sliding
, you'd use:
text.map{_.split(" ")}.filter{_.size > 2}.map{_.sliding(2)}
Upvotes: 2