Reputation: 29
I have this function
customSchema = StructType([ \
StructField("a", Doubletype(), True), \
StructField("b", Doubletype(), True),
StructField("c", Doubletype(), True),
StructField("d", Doubletype(), True)])
n_1= sc.textFile("/path/*.txt")\
.mapPartitions(lambda partition: csv.reader([line.replace('\0','') for line in partition],delimiter=';', quotechar='"')).filter(lambda line: len(line) > 1 )\
.toDF(customSchema)
which would create a Dataframe, the problem is that ' .mapPartitions' will use as default type <class 'str'> and i need to cast it to DoubleType before convert it into Dataframe. Any idea?
Sample data
[['0,01', '344,01', '0,00', '0,00']]
or just work with
n_1= sc.textFile("/path/*.txt")\
.mapPartitions(lambda partition: csv.reader([line.replace('\0','') for line in partition],delimiter=';', quotechar='"')).filter(lambda line: len(line) > 1 )\
Upvotes: 1
Views: 10993
Reputation: 29
First, it was necesary to collect all the elements and create a matrix (list of lists) using the second option.
n_1= sc.textFile("/path/*.txt")\
.mapPartitions(lambda partition: csv.reader([line.replace('\0','') for line in partition],delimiter=';', quotechar='"')).filter(lambda line: len(line) > 1 )\
matrix = n_1.collect()
Once we have this, it is necesary to know which type of data comes into the sublists (in my case it was 'str').
matrix =[[x.replace(',', '.') for x in i] for i in matrix ] # replace ',' for '.' in order to perform the data type convertion
matrix = [[float(str(x)) for x in i] for i in matrix ] #convert every sublist element into float
df = sc.parallelize(matrix).toDF()
Upvotes: 0