Reputation: 77
Actually I'm working in workspace of Apache Spark Python in Databricks of Cloudera. The idea is to read a csv and format each field.
So, the first step was to read the csv:
uber = sc.textFile("dbfs:/mnt/uber/201601/pec2/uber_curated.csv")
The next step was to convert each line to a list of values:
uber_parsed = uber.map(lambda lin:lin.split(","))
print (uber_parsed.first())
The result was:
[u'B02765', u'2015-05-08 19:05:00', u'B02764', u'262', u'Manhattan',u'Yorkville East']
But, now I need to convert each item of the next list of values to next format String, Date, String, Integer, String, String.
[[u'B02765', u'2015-05-08 19:05:00', u'B02764', u'262', u'Manhattan', u'Yorkville East'],
[u'B02767', u'2015-05-08 19:05:00', u'B02789', u'400', u'New York', u'Yorkville East']]
Somebody knows how to do it?
Upvotes: 2
Views: 1162
Reputation: 1712
You can use csv
reader. In Spark 1.x you'll need an external dependency (spark-csv
).
from pyspark.sql.types import *
sqlContext.read.format("csv").schema(StructType([
StructField("_1", StringType()),
StructField("_2", TimestampType()),
StructField("_3", StringType()),
StructField("_4", IntegerType()),
StructField("_5", StringType()),
StructField("_6", StringType()),
])).load("dbfs:/mnt/uber/201601/pec2/uber_curated.csv").rdd
or
sqlContext.read.format("csv").schema(StructType([
StructField("_1", StringType()),
StructField("_2", DateType()),
StructField("_3", StringType()),
StructField("_4", IntegerType()),
StructField("_5", StringType()),
StructField("_6", StringType()),
])).option("dateFormat", "yyyy-dd-MM HH:mm:ss").load(
"dbfs:/mnt/uber/201601/pec2/uber_curated.csv"
).rdd
You can replace (_1
, _2
.._n
) with descriptive field names.
Upvotes: 1