lampShadesDrifter
lampShadesDrifter

Reputation: 4159

How to get datatype for specific field name from schema attribute of pyspark dataframe (from parquet files)?

Have a folder of parquet files that I am reading into a pyspark session. How can I inspect / parse the individual schema field types and other info (eg. for the purpose of comparing schemas between dataframes to see exact type differences)?

I can see the parquet schema and specific field names with something like...

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
sparkSession = SparkSession.builder.appName("data_debugging").getOrCreate()

df = sparkSession.read.option("header", "true").parquet("hdfs://hw.co.local:8020/path/to/parquets")
df.schema # or df.printSchema()
df.fieldNames()

So I can see the schema

StructType(List(StructField(SOME_FIELD_001,StringType,true),StructField(SOME_FIELD_002,StringType,true),StructField(SOME_FIELD_003,StringType,true)))

but not sure how to get the values for specific fields, eg. something like...

df.schema.getType("SOME_FIELD_001")
or
df.schema.getData("SOME_FIELD_001")  #type: dict

Does anyone know how to do something like this?

Upvotes: 2

Views: 2212

Answers (2)

noufel13
noufel13

Reputation: 663

Method 1:

You can use the df.dtypes method to get the field name along with it's datatype and the same can be converted to a dict object as shown below,

myschema = dict(df.dtypes)

Now, you can obtain the datatypes as shown below,

myschema.get('some_field_002')

Output:

'string'

Method 2:

Alternatively, if you want the datatypes as a pyspark.sql.types object, you can use the df.schema method and create a custom schema dictionary as shown below,

myschema = dict(map(lambda x: (x.name, x.dataType), df.schema.fields))

print(myschema.get('some_field_002'))

Output:

StringType

Upvotes: 0

murtihash
murtihash

Reputation: 8410

This function collects (name,type,nullability) in a dict, and makes it easy to lookup info based on column name of dataframe.

If name is specified as df, the metadata dict will be called df.meta

name=df #enter name of dataframe here
def metadata(name): #function for getting metadata in a dict
  null=[str(n.nullable) for n in name.schema.fields] #nullability
  types=[str(i.dataType) for i in name.schema.fields] #type 
  both = [list(a) for a in zip(types, null)]#combine type+nullability
  names= name.columns #names of columns
  final = {} #create dict
  for key in names: 
     for value in both: 
          final[key] = value
          both.remove(value)
          break
  return final
name.meta= metadata(name) #  final dict is called df.meta
                          # if name=df2, final dict will be df2.meta

Now you can compare column info of different dataframe.

example:

Input: df.meta
Output: {'col1': ['StringType', 'True'],
         'col2': ['StringType', 'True'],
         'col3': ['LongType', 'True'],
         'col4': ['StringType', 'True']}

#get column info
Input: df.meta['col1']
Output: ['StringType', 'True']

#compare column type + nullability
Input: df.meta['col1'] == df2.meta['col1']
Ouput: True/False


#compare only column type
Input: df.meta['col1'][0] == df2.meta['col1'][0]
Output: True/False

#compare only nullability
Input: df.meta['col1'][1] == df2.meta['col1'][1]
Output: True/False

Upvotes: 1

Related Questions