Rami
Rami

Reputation: 8314

Spark: TreeAgregate at IDF is taking ages

I am using Spark 1.6.1, and I have a DataFrame as follow:

+-----------+---------------------+-------------------------------+
|ID         |dateTime             |title                          |
+-----------+---------------------+-------------------------------+
|809907895  |2017-01-21 23:00:01.0|                               |
|1889481973 |2017-01-21 23:00:06.0|Man charged with murder of ... |
|979847722  |2017-01-21 23:00:09.0|Munster cruise to home Cham... |
|18894819734|2017-01-21 23:00:11.0|Man charged with murder of ... |
|17508023137|2017-01-21 23:00:15.0|Emily Ratajkowski hits the ... |
|10321187627|2017-01-21 23:00:17.0|Gardai urge public to remai... |
|979847722  |2017-01-21 23:00:19.0|Sport                          |
|19338946129|2017-01-21 23:00:33.0|                               |
|979847722  |2017-01-21 23:00:35.0|Rassie Erasmus reveals the ... |
|1836742863 |2017-01-21 23:00:49.0|NAMA sold flats which could... |
+-----------+---------------------+-------------------------------+

I am doing the following operation:

val aggDF = df.groupBy($"ID")
              .agg(concat_ws(" ", collect_list($"title")) as "titlesText")

Then on aggDF DataFrame, I am fitting a pipeline that extracts TFIDF feature from titlesText column (by applying tokenizer, stopWordRemover, HashingTF then IDF).

When I call the pipline.fit(aggDF) the code reaches a stage treeAggregate at IDF.scala:54 (I can see that on the UI), and then it gets stuck there, without any progress, without any error, I wait very long time without any progress and no helpful information on UI.

Here is an example of what I see in the UI (nothing changes for very long time):

enter image description here

Upvotes: 3

Views: 378

Answers (1)

BenFradet
BenFradet

Reputation: 453

Did you specify a maximum number of features in your HashingTF?

Because the amount of data the IDF has to deal with will be proportional to the number of features produced by HashingTF and it will most likely have to spill on disk for very large amounts which wastes time.

Upvotes: 2

Related Questions