Reputation: 33
I am trying to get percentage frequencies in pyspark. I did this in python as follows
Companies = df['Company'].value_counts(normalize = True)
Getting the frequencies is fairly straightforward:
# Dates in descending order of complaint frequency
df.createOrReplaceTempView('Comp')
CompDF = spark.sql("SELECT Company, count(*) as cnt \
FROM Comp \
GROUP BY Company \
ORDER BY cnt DESC")
CompDF.show()
+--------------------+----+
| Company| cnt|
+--------------------+----+
|BANK OF AMERICA, ...|1387|
| EQUIFAX, INC.|1285|
|WELLS FARGO & COM...|1119|
|Experian Informat...|1115|
|TRANSUNION INTERM...|1001|
|JPMORGAN CHASE & CO.| 905|
| CITIBANK, N.A.| 772|
|OCWEN LOAN SERVIC...| 481|
How do I get to percent frequencies from here? I tried a bunch of things with not much luck. Any help would be appreciated.
Upvotes: 3
Views: 4353
Reputation: 60319
As Suresh implies in the comments, assuming that total_count
is the number of rows in dataframe Companies
, you can use withColumn
to add a new column named percentages
in CompDF
:
total_count = Companies.count()
df = CompDF.withColumn('percentage', CompDF.cnt/float(total_counts))
Upvotes: 3
Reputation: 11
May be modifying the SQL query will get you the result you want.
"SELECT Company,cnt/(SELECT SUM(cnt) from (SELECT Company, count(*) as cnt
FROM Comp GROUP BY Company ORDER BY cnt DESC) temp_tab) sum_freq from
(SELECT Company, count(*) as cnt FROM Comp GROUP BY Company ORDER BY cnt
DESC)"
Upvotes: 0