Murat
Murat

Reputation: 33

How can I obtain percentage frequencies in pyspark

I am trying to get percentage frequencies in pyspark. I did this in python as follows

Companies = df['Company'].value_counts(normalize = True)

Getting the frequencies is fairly straightforward:

# Dates in descending order of complaint frequency 
df.createOrReplaceTempView('Comp')
CompDF = spark.sql("SELECT Company, count(*) as cnt \
                    FROM Comp \
                    GROUP BY Company \
                    ORDER BY cnt DESC")
CompDF.show()
+--------------------+----+  
|             Company| cnt|  
+--------------------+----+  
|BANK OF AMERICA, ...|1387|  
|       EQUIFAX, INC.|1285|  
|WELLS FARGO & COM...|1119|  
|Experian Informat...|1115|  
|TRANSUNION INTERM...|1001|  
|JPMORGAN CHASE & CO.| 905|  
|      CITIBANK, N.A.| 772|  
|OCWEN LOAN SERVIC...| 481|  

How do I get to percent frequencies from here? I tried a bunch of things with not much luck. Any help would be appreciated.

Upvotes: 3

Views: 4353

Answers (2)

desertnaut
desertnaut

Reputation: 60319

As Suresh implies in the comments, assuming that total_count is the number of rows in dataframe Companies, you can use withColumn to add a new column named percentages in CompDF:

total_count = Companies.count()

df = CompDF.withColumn('percentage', CompDF.cnt/float(total_counts))

Upvotes: 3

sp_user
sp_user

Reputation: 11

May be modifying the SQL query will get you the result you want.

    "SELECT Company,cnt/(SELECT SUM(cnt) from (SELECT Company, count(*) as cnt 
    FROM Comp GROUP BY Company ORDER BY cnt DESC) temp_tab) sum_freq from 
    (SELECT Company, count(*) as cnt FROM Comp GROUP BY Company ORDER BY cnt 
    DESC)"

Upvotes: 0

Related Questions