Reputation: 3728
I have csv
with newline in column. Following is my example:
"A","B","C"
1,"This is csv with
newline","This is another column"
"This is newline
and another line","apple","cat"
I can read the file in spark but the newline inside the column is treated as a separate line.
How can I ready this as a csv with text enclosed inside double quote.
I read file using both apache csv plugin and apache only.
alarms = sc.textFile("D:\Dataset\oneday\oneday.csv")
This gives me RDD :
**example.take(5)**
[u'A,B,C', u'1,"This is csv with ', u'newline",This is another column', u'"This is newline', u'and another line",apple,cat']
Spark Version : 1.4
Upvotes: 0
Views: 2577
Reputation: 3728
If you want to create dataframe from csv with newline and quoted by double quote without reinventing wheel then use spark-csv and common-csv library:
from pyspark.sql import SQLContext
df = sqlContext.load(header="true",source="com.databricks.spark.csv", path = "hdfs://analytics.com.np:8020/hdp/badcsv.csv")
Upvotes: 0
Reputation: 148880
The csv module from the standard python library does it out of the box:
>>> txt = '''"A","B","C"
1,"This is csv with
newline","This is another column"
"This is newline
and another line","apple","cat"'''
>>> import csv
>>> import io
>>> with io.BytesIO(txt) as fd:
rd = csv.reader(fd)
for row in rd:
print row
['A', 'B', 'C']
['1', 'This is csv with \nnewline', 'This is another column']
['This is newline\nand another line', 'apple', 'cat']
This can be used (with a significant performance penalty over textFile
) with binaryFiles
:
>>> (sc.binaryFiles(path)
.values()
.flatMap(lambda x: csv.reader(io.BytesIO(x))))
Upvotes: 2
Reputation: 15204
You do not need to import anything. The solution proposed below creates a second file just for demonstration purposes. You can read the line after you modify it without writing it anywhere.
with open(r'C:\Users\evkouni\Desktop\test_in.csv', 'r') as fin:
with open(r'C:\Users\evkouni\Desktop\test_out.csv', 'w') as fout:
cont = fin.readlines()
for line in cont[:-1]:
if line.count('"') % 2 == 1 and '"\n' not in line:
line = line.replace('\n', '')
fout.write(line)
#DEMO
#test_in.csv
#------------
#"A";"B";"C"
#1;"This is csv with
#newline";"This is another column"
#"This is newline
#test_out.csv
#------------
#"A";"B";"C"
#1;"This is csv with newline";"This is another column"
#"This is newline
If there is something that's not clear to you, please let me know.
Upvotes: 0