Reputation: 557
I have Data Sets as Below:
I am using PySpark to parse the data and create a DataFrame later using below code:
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql import functions as f
def parseInput(line):
fields = line.split(',')
stationID=fields[0]
entryType=fields[2]
temperature= fields[3]*0.3
return Row(stationID,entryType,temperature)
spark = SparkSession.builder.appName("MinTemperatures").getOrCreate()
lines = spark.sparkContext.textFile("data/1800.csv")
temperatures = lines.map(parseInput)
minTemps=temperatures.filter(lambda x:x[1]=='TMIN')
df = spark.createDataFrame(minTemps)
I got below error:
TypeError: can't multiply sequence by non-int of type 'float'
Obviously, if I remove 0.3 out of temperature= fields[3]*0.3
, the create DataFrame work. How can I return the temperature
with float number and some basic math operation?
Upvotes: 0
Views: 719
Reputation: 167
You can read the file without multiplication first and then cast it to Type Double, do the multiplication finally.
I assume your csv file have header.
The following code is for casting:
data = data.withColumn("COLUMN_NAME", data["COLUMN_NAME"].cast("double"))
Upvotes: 1