Reputation: 13
I have a dataframe that looks like this:
|--------------------------|
| text |
|--------------------------|
| ["hello", "world"] |
|--------------------------|
| ["foo"] |
|--------------------------|
| ["world", "bar"] |
|--------------------------|
| ["kung", "foo", "world"]|
|--------------------------|
| ... |
And I need to count the occurrences of each word and sort it from greatest to least. I do not know all the words it may contain. How can I manipulate this to do that?
It would look something like this when done
|-----------------|-----------------|
| word | count |
|-----------------|-----------------|
| hello | 1 |
|-----------------|-----------------|
| world | 3 |
|-----------------|-----------------|
| bar | 1 |
|-----------------|-----------------|
| foo | 2 |
|-----------------|-----------------|
| ... | ... |
I've tried flattening the dataframe, maping over it, etc. etc. and I have super stuck. Any help would be appreciated!
Upvotes: 1
Views: 72
Reputation: 679
Presuming that has been stored as a list of string:
df.select(explode($"text").as("word")).groupBy("word").count.orderBy(desc("count"))
Upvotes: 4