Reputation: 3672
Hi I have a DataFrame as shown -
ID X Y
1 1234 284
1 1396 179
2 8620 178
3 1620 191
3 8820 828
I want split this DataFrame into multiple DataFrames based on ID. So for this example there will be 3 DataFrames. One way to achieve it is to run filter operation in loop. However, I would like to know if it can be done in much more efficient way.
Upvotes: 11
Views: 25828
Reputation: 97
The answer of @James Tobin needs to be altered a tiny bit if you are working with Python 3.X, as dict.values returns a dict-value object instead of a list. A quick workaround is just adding the list function:
listids = [list(x.asDict().values())[0]
for x in df.select("ID").distinct().collect()]
Posting as a seperate answer as I do not have the reputation required to put a comment on his answer.
Upvotes: 6
Reputation: 3110
#initialize spark dataframe
df = sc.parallelize([ (1,1234,282),(1,1396,179),(2,8620,178),(3,1620,191),(3,8820,828) ] ).toDF(["ID","X","Y"])
#get the list of unique ID values ; there's probably a better way to do this, but this was quick and easy
listids = [x.asDict().values()[0] for x in df.select("ID").distinct().collect()]
#create list of dataframes by IDs
dfArray = [df.where(df.ID == x) for x in listids]
dfArray[0].show()
+---+----+---+
| ID| X| Y|
+---+----+---+
| 1|1234|282|
| 1|1396|179|
+---+----+---+
dfArray[1].show()
+---+----+---+
| ID| X| Y|
+---+----+---+
| 2|8620|178|
+---+----+---+
dfArray[2].show()
+---+----+---+
| ID| X| Y|
+---+----+---+
| 3|1620|191|
| 3|8820|828|
+---+----+---+
Upvotes: 10