Reputation: 91
when I use df.show()
to view the pyspark dataframe in jupyter notebook
It show me that:
+---+-------+-------+-------+------+-----------+-----+-------------+-----+---------+----------+-----+-----------+-----------+--------+---------+-------+------------+---------+------------+---------+---------------+------------+---------------+---------+------------+
| Id|groupId|matchId|assists|boosts|damageDealt|DBNOs|headshotKills|heals|killPlace|killPoints|kills|killStreaks|longestKill|maxPlace|numGroups|revives|rideDistance|roadKills|swimDistance|teamKills|vehicleDestroys|walkDistance|weaponsAcquired|winPoints|winPlacePerc|
+---+-------+-------+-------+------+-----------+-----+-------------+-----+---------+----------+-----+-----------+-----------+--------+---------+-------+------------+---------+------------+---------+---------------+------------+---------------+---------+------------+
| 0| 24| 0| 0| 5| 247.3000| 2| 0| 4| 17| 1050| 2| 1| 65.3200| 29| 28| 1| 591.3000| 0| 0.0000| 0| 0| 782.4000| 4| 1458| 0.8571|
| 1| 440875| 1| 1| 0| 37.6500| 1| 1| 0| 45| 1072| 1| 1| 13.5500| 26| 23| 0| 0.0000| 0| 0.0000| 0| 0| 119.6000| 3| 1511| 0.0400|
| 2| 878242| 2| 0| 1| 93.7300| 1| 0| 2| 54| 1404| 0| 0| 0.0000| 28| 28| 1| 0.0000| 0| 0.0000| 0| 0| 3248.0000| 5| 1583| 0.7407|
| 3|1319841| 3| 0| 0| 95.8800| 0| 0| 0| 86| 1069| 0| 0| 0.0000| 97| 94| 0| 0.0000| 0| 0.0000| 0| 0| 21.4900| 1| 1489| 0.1146|
| 4|1757883| 4| 0| 1| 0.0000| 0| 0| 1| 58| 1034| 0| 0| 0.0000| 47|
How can I get a formatted dataframe just like pandas dataframe to view the data more efficiently?
Upvotes: 9
Views: 14327
Reputation: 27950
As @sat mentioned in their answer you can use:
df.toPandas()
Or better to limit:
df.limit(10).toPandas()
# where 10 is the number of rows
to convert your dataframe into pandas dataframe.
However if you want to see your data in pyspark you can use :
df.show(10,truncate=False)
If you want to see each row of your dataframe individually then use:
df.show(10, vertical=True)
Also, you can find the total number of records with :
df.count()
Upvotes: 1
Reputation: 623
You can use the ability to convert a pyspark dataframe directly to a pandas dataframe. The command for the same would be -
df.limit(10).toPandas()
This should directly yield the result as a pandas dataframe and you just need to have pandas package installed.
Upvotes: 10
Reputation: 347
You have to use the below code
from IPython.display import display
import pandas as pd
import numpy as np
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
display(df)
Upvotes: 0