Reputation: 8314
I am using Spark 1.6.1, and I have a DataFrame as follow:
+-----------+---------------------+-------------------------------+
|ID |dateTime |title |
+-----------+---------------------+-------------------------------+
|809907895 |2017-01-21 23:00:01.0| |
|1889481973 |2017-01-21 23:00:06.0|Man charged with murder of ... |
|979847722 |2017-01-21 23:00:09.0|Munster cruise to home Cham... |
|18894819734|2017-01-21 23:00:11.0|Man charged with murder of ... |
|17508023137|2017-01-21 23:00:15.0|Emily Ratajkowski hits the ... |
|10321187627|2017-01-21 23:00:17.0|Gardai urge public to remai... |
|979847722 |2017-01-21 23:00:19.0|Sport |
|19338946129|2017-01-21 23:00:33.0| |
|979847722 |2017-01-21 23:00:35.0|Rassie Erasmus reveals the ... |
|1836742863 |2017-01-21 23:00:49.0|NAMA sold flats which could... |
+-----------+---------------------+-------------------------------+
I am doing the following operation:
val aggDF = df.groupBy($"ID")
.agg(concat_ws(" ", collect_list($"title")) as "titlesText")
Then on aggDF
DataFrame, I am fitting a pipeline that extracts TFIDF feature from titlesText
column (by applying tokenizer
, stopWordRemover
, HashingTF
then IDF
).
When I call the pipline.fit(aggDF)
the code reaches a stage treeAggregate at IDF.scala:54
(I can see that on the UI), and then it gets stuck there, without any progress, without any error, I wait very long time without any progress and no helpful information on UI.
Here is an example of what I see in the UI (nothing changes for very long time):
Upvotes: 3
Views: 378
Reputation: 453
Did you specify a maximum number of features in your HashingTF?
Because the amount of data the IDF has to deal with will be proportional to the number of features produced by HashingTF and it will most likely have to spill on disk for very large amounts which wastes time.
Upvotes: 2