Spark: TreeAgregate at IDF is taking ages

Question

I am using Spark 1.6.1, and I have a DataFrame as follow:

+-----------+---------------------+-------------------------------+
|ID         |dateTime             |title                          |
+-----------+---------------------+-------------------------------+
|809907895  |2017-01-21 23:00:01.0|                               |
|1889481973 |2017-01-21 23:00:06.0|Man charged with murder of ... |
|979847722  |2017-01-21 23:00:09.0|Munster cruise to home Cham... |
|18894819734|2017-01-21 23:00:11.0|Man charged with murder of ... |
|17508023137|2017-01-21 23:00:15.0|Emily Ratajkowski hits the ... |
|10321187627|2017-01-21 23:00:17.0|Gardai urge public to remai... |
|979847722  |2017-01-21 23:00:19.0|Sport                          |
|19338946129|2017-01-21 23:00:33.0|                               |
|979847722  |2017-01-21 23:00:35.0|Rassie Erasmus reveals the ... |
|1836742863 |2017-01-21 23:00:49.0|NAMA sold flats which could... |
+-----------+---------------------+-------------------------------+

I am doing the following operation:

val aggDF = df.groupBy($"ID")
              .agg(concat_ws(" ", collect_list($"title")) as "titlesText")

Then on aggDF DataFrame, I am fitting a pipeline that extracts TFIDF feature from titlesText column (by applying tokenizer, stopWordRemover, HashingTF then IDF).

When I call the pipline.fit(aggDF) the code reaches a stage treeAggregate at IDF.scala:54 (I can see that on the UI), and then it gets stuck there, without any progress, without any error, I wait very long time without any progress and no helpful information on UI.

Here is an example of what I see in the UI (nothing changes for very long time):

What are the possible reasons of that?
How to track and debug such problems?
Is there any other way to extract the same feature?

BenFradet · Accepted Answer

Did you specify a maximum number of features in your HashingTF?

Because the amount of data the IDF has to deal with will be proportional to the number of features produced by HashingTF and it will most likely have to spill on disk for very large amounts which wastes time.

Spark: TreeAgregate at IDF is taking ages

Answers (1)

Related Questions