USB
USB

Reputation: 6139

AssertionError: dataType StringType() should be an instance of <class 'pyspark.sql.types.DataType'> in pyspark

I am trying to generalize schema for creating empty tables in pyspark. My list holds colname and datatype seperated with space.

Below is my code.

I could generalize col name, but it is not able to cast the type.

from pyspark.sql.types import *
tblColumns = [  'emp_name StringType()'
              , 'confidence DoubleType()'
              , 'addressType StringType()'
              , 'reg StringType()'
              , 'inpindex IntegerType()'
              ]

def createEmptyTable(tblColumns):
  structCols = [StructField(colName.split(' ')[0], (colName.split(' ')[1]), True)
    for colName in tblColumns]
  print('Returning cols', structCols)
  return(structCols)
createEmptyTable(tblColumns)

Gives below error.

AssertionError: dataType StringType() should be an instance of <class 'pyspark.sql.types.DataType'>

Is there a way to make datatype as generic

Upvotes: 0

Views: 2425

Answers (1)

Benny Elgazar
Benny Elgazar

Reputation: 361

Yes well, it's throwing an error on you because it's a string. You should cast it somehow by some mapping so for example instead of (colName.split(' ')[1]) you should do some mapping table

from pyspark.sql.types import *
datatype = {
'StringType': StringType
...
}


def createEmptyTable(tblColumns):
  structCols = [StructField(colName.split(' ')[0], datatype[colName.split(' ')[1]](), True)
    for colName in tblColumns]

This way should work, be aware that you will have to declare all the types mapping.

Upvotes: 1

Related Questions