Reputation: 2750
I have a data frame as below:
df = sqlContext.createDataFrame([("count","doc_3",3), ("count","doc_2",6), ("type","doc_1",9), ("type","doc_2",6), ("one","doc_2",10)]).withColumnRenamed("_1","word").withColumnRenamed("_2","document").withColumnRenamed("_3","occurences")
From this I need to create the matrix like below:
----------+-----+------+----+
|document |count| type |one |
+---------+-----+------|----+
|doc_1 | 0 | 9 | 0 |
|doc_2 | 6 | 6 | 10 |
|doc_3 | 3 | 0 | 0 |
So I tried
print df.crosstab("document").show()
which didn't give what I wanted .Any help is appreciated
Upvotes: 0
Views: 302
Reputation: 19550
You are looking for pivot:
df = sqlContext.createDataFrame([("count","doc_3",3), ("count","doc_2",6), ("type","doc_1",9), ("type","doc_2",6), ("one","doc_2",10)], ["word", "document","occurences"])
#document is the column you want to keep
#word is the columns which contains the rows which should become columns
#all other columns will be used as value for the new dataframe
#a function like max() is required as wants to know what it should do if
#it has two rows with the same value for document and word
df = df.groupby('document').pivot('word').max()
df = df.fillna(0)
df.show()
Output:
+--------+-----+---+----+
|document|count|one|type|
+--------+-----+---+----+
| doc_1| 0| 0| 9|
| doc_3| 3| 0| 0|
| doc_2| 6| 10| 6|
+--------+-----+---+----+
Upvotes: 1