kinkajou
kinkajou

Reputation: 3728

reading csv file enclosed in double quote but with newline

I have csv with newline in column. Following is my example:

"A","B","C"
1,"This is csv with 
newline","This is another column"
"This is newline
and another line","apple","cat"

I can read the file in spark but the newline inside the column is treated as a separate line.

How can I ready this as a csv with text enclosed inside double quote.

I read file using both apache csv plugin and apache only.

alarms = sc.textFile("D:\Dataset\oneday\oneday.csv")

This gives me RDD :

**example.take(5)**

[u'A,B,C', u'1,"This is csv with ', u'newline",This is another column', u'"This is newline', u'and another line",apple,cat']

Spark Version : 1.4

Upvotes: 0

Views: 2577

Answers (3)

kinkajou
kinkajou

Reputation: 3728

If you want to create dataframe from csv with newline and quoted by double quote without reinventing wheel then use spark-csv and common-csv library:

from pyspark.sql import SQLContext
df = sqlContext.load(header="true",source="com.databricks.spark.csv", path = "hdfs://analytics.com.np:8020/hdp/badcsv.csv")

Upvotes: 0

Serge Ballesta
Serge Ballesta

Reputation: 148880

The csv module from the standard python library does it out of the box:

>>> txt = '''"A","B","C"
1,"This is csv with 
newline","This is another column"
"This is newline
and another line","apple","cat"'''
>>> import csv
>>> import io
>>> with io.BytesIO(txt) as fd:
    rd = csv.reader(fd)
    for row in rd:
        print row


['A', 'B', 'C']
['1', 'This is csv with \nnewline', 'This is another column']
['This is newline\nand another line', 'apple', 'cat']

This can be used (with a significant performance penalty over textFile) with binaryFiles:

>>> (sc.binaryFiles(path)
        .values()
        .flatMap(lambda x: csv.reader(io.BytesIO(x))))

Upvotes: 2

Ma0
Ma0

Reputation: 15204

You do not need to import anything. The solution proposed below creates a second file just for demonstration purposes. You can read the line after you modify it without writing it anywhere.

with open(r'C:\Users\evkouni\Desktop\test_in.csv', 'r') as fin:
    with open(r'C:\Users\evkouni\Desktop\test_out.csv', 'w') as fout:
        cont = fin.readlines()
        for line in cont[:-1]:
            if line.count('"') % 2 == 1 and '"\n' not in line:
                line = line.replace('\n', '')
            fout.write(line)

#DEMO

#test_in.csv
#------------
#"A";"B";"C"
#1;"This is csv with 
#newline";"This is another column"
#"This is newline

#test_out.csv
#------------
#"A";"B";"C"
#1;"This is csv with newline";"This is another column"
#"This is newline

If there is something that's not clear to you, please let me know.

Upvotes: 0

Related Questions