Reputation: 894
I have a csv file which contains multiple fields. There are few fields for which data is coming in byte array format(b'1234'). I want to remove these b'(convert from byte array to string).
I came to know that we can convert byte array to string in two ways.
>>> s1 = b'Hi'
>>> s2 = s1.decode('utf-8')
>>> print(s2)
Hi
>>> s1 = b'Hi'
>>> s2 = str(s1, 'utf-8')
>>> print(s2)
Hi
As there so many fields in CSV out of which only few fields contains byte array, I can't directily apply the function to each and every field. I don't have idea of which are are byte array fields and which ate string, int fields.
Any suggestions to convert byte array to string in csv file. I'm trying to do this in spark.
My code snippet :
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true') \
.option("delimiter", ",")\
.option("multiLine", "true")\
.load("file.csv")
Input Data:
b'1234',b'123',Hi,"Hello",b'2FB',b'272',b'4943',b'786',2018-02-19,,Out,768,"Data",b'502',351,
As schema changes dynamically we don't have control of knowing which are byte array and which are string. I tried this solution, however it didn't worked for me (it converted all the fields to nan).
Upvotes: 4
Views: 10018
Reputation: 41957
As you said you have a csv file containing bytes as
b'1234',b'123',Hi,"Hello",b'2FB',b'272',b'4943',b'786',2018-02-19,,Out,768,"Data",b'502',351,
The straight forward solution I see to this is to replace the b'
and '
strings with empty strings and parse the data to form dataframe
.
rdd = sc.textFile("path to your csv file")\
.map(lambda x: x.replace("b'", "").replace("'", ""))
Updated
As @ixaxaar commented
A better way is to do lambda x: x[2:-1]
So you can just do
rdd = sc.textFile("path to your csv file").map(lambda x: x[2:-1])
Upvotes: 1