Reputation: 41
i have json file that contain some data, i converted this json to pyspark dataframe(i chose some columns not all of them) this is my code:
import os
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.sql import SparkSession
import json
from pyspark.sql.functions import col
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
df=spark.read.json("/Users/deemaalomair/PycharmProj
ects/first/deema.json").select('full_text',
'retweet_count', 'favorite_count')
c=df.count()
print(c)
df.orderBy(["retweet_count", "favorite_count"], ascending=[0, 0]).show(10)
and this is the output:
+--------------------+-------------+--------------+
| full_text|retweet_count|favorite_count|
+--------------------+-------------+--------------+
|Check out this in...| 388| 785|
|Review – Apple Ai...| 337| 410|
|This #iPhone atta...| 159| 243|
|March is #Nationa...| 103| 133|
|📱
Amazing vide...| 87| 139|
|Business email wi...| 86| 160|
|#wallpapers #iPho...| 80| 385|
|#wallpapers #iPho...| 71| 352|
|#wallpapers #iPho...| 57| 297|
|Millions of #iPho...| 46| 52|
+--------------------+-------------+--------------+
only showing top 10 rows
Q1: now i need to order this dataframe by descending order for two columns at the same time ('retweet_count' , 'favorite_count')
i tried multiple functions like the one above and `
Cols = ['retweet_count','favorite_count']
df = df.OrderBy(cols,ascending=False).show(10)
but all of them just order the first columns and skip the second ! i do not what i am doing wrong. i know there are lot of same questions , but i tried everything before posting the question here!
Q2: dataframe output for the fulltext is shorten how can i print the whole text ?
Upvotes: 2
Views: 7673
Reputation: 173
If you are trying to see the descending values in two columns simultaneously, that is not going to happen as each column has it's own separate order.
In the above data frame you can see that both the retweet_count and favorite_count has it's own order. This is the case with your data.
>>> import os
>>> from pyspark import SparkContext
>>> from pyspark.streaming import StreamingContext
>>> from pyspark.sql import SparkSession
>>> sc = SparkContext.getOrCreate()
>>> spark = SparkSession(sc)
>>> df = spark.read.format('csv').option("header","true").load("/home/samba693/test.csv")
>>> df.show()
+---------+-------------+--------------+
|full_text|retweet_count|favorite_count|
+---------+-------------+--------------+
| abc| 45| 45|
| def| 50| 40|
| ghi| 50| 39|
| jkl| 50| 41|
+---------+-------------+--------------+
When we are applying order by based on two columns, what exactly is happening is, it is ordering by based on the first column, if there is a tie, it is taking the second column's value into consideration. But this might be not what you are seeing for. You are seeing for sorting both the columns based on their sum.
>>> df.orderBy(["retweet_count", "favorite_count"], ascending=False).show()
+---------+-------------+--------------+
|full_text|retweet_count|favorite_count|
+---------+-------------+--------------+
| jkl| 50| 41|
| def| 50| 40|
| ghi| 50| 39|
| abc| 45| 45|
+---------+-------------+--------------+
One way to work around this is adding a new column with sum of these both column and apply orderby on the new column and remove the new column after ordering.
>>> from pyspark.sql.functions import expr
>>> df1 = df.withColumn('total',expr("retweet_count+favorite_count"))
>>> df1.show()
+---------+-------------+--------------+-----+
|full_text|retweet_count|favorite_count|total|
+---------+-------------+--------------+-----+
| abc| 45| 45| 90.0|
| def| 50| 40| 90.0|
| ghi| 50| 39| 89.0|
| jkl| 50| 41| 91.0|
+---------+-------------+--------------+-----+
Ordering by using new column and removing it later
>>> df2 = df1.orderBy("total", ascending=False)
>>> df2.show()
+---------+-------------+--------------+-----+
|full_text|retweet_count|favorite_count|total|
+---------+-------------+--------------+-----+
| jkl| 50| 41| 91.0|
| abc| 45| 45| 90.0|
| def| 50| 40| 90.0|
| ghi| 50| 39| 89.0|
+---------+-------------+--------------+-----+
>>> df = df2.select("full_text","retweet_count","favorite_count")
>>> df.show()
+---------+-------------+--------------+
|full_text|retweet_count|favorite_count|
+---------+-------------+--------------+
| jkl| 50| 41|
| abc| 45| 45|
| def| 50| 40|
| ghi| 50| 39|
+---------+-------------+--------------+
Hope this helps!
Upvotes: 3