Shankar Panda
Shankar Panda

Reputation: 832

PySpark- How to Calculate Min, Max value of each field using Pyspark?

I am trying to find the min , max of each field resulted from the sql statement and write it to a csv file. I am trying to get the result in below fashion. Could you please help. I already have written in python but now trying to convert it to pyspark to run in hadoop cluster directly

enter image description here

from pyspark.sql.functions import max, min, mean, stddev
from pyspark import SparkContext
sc =SparkContext()
from pyspark.sql import HiveContext
hive_context = HiveContext(sc)
#bank = hive_context.table("cip_utilities.file_upload_temp")
data=hive_context.sql("select * from cip_utilities.cdm_variables_dict")
hive_context.sql("describe cip_utilities.cdm_variables_dict").registerTempTable("schema_def")
temp_data=hive_context.sql("select * from schema_def")
temp_data.show()
data1=hive_context.sql("select col_name from schema_def where data_type<>'string'")
colum_names_as_python_list_of_rows = data1.collect()
#data1.show()
for line in colum_names_as_python_list_of_rows:
        #print value in MyCol1 for each row                
        ---Here i need to calculate min, max, mean etc for this particular field send by the for loop

Upvotes: 3

Views: 31414

Answers (1)

Neeraj Bhadani
Neeraj Bhadani

Reputation: 3110

There are different functions you can use to find min, max values. Here is one of the way to get these details on dataframe columns using agg function.

from pyspark.sql.functions import *
df = spark.table("HIVE_DB.HIVE_TABLE")
df.agg(min(col("col_1")), max(col("col_1")), min(col("col_2")), max(col("col_2"))).show()

However, you can also explore describe and summary (version 2.3 onwards) functions to get basic statistics for various columns in your dataframe.

Hope this helps.

Upvotes: 11

Related Questions