Sumanta
Sumanta

Reputation: 47

Reading .csv data with inconsistent pattern

I have a very huge CSV file. I want to read it through Pyspark but I am not able to read it properly.

Sample csv as

"keyvalue","rto","state","maker_model","veh_type","veh_class"
"hnjsnjncjssssmj", "OD", "ODISHA", "BAJAJ AUTO", "Private Vehicle", "Car"
"hnjsnjncjssssjj", "OD", "ODISHA", "BAJAJ AUTO
", "Private Vehicle", "Car"
"hnjsnjncjssssmm", "GO", "GOA", "TATA MOTORS", "Private Vehicle", "Bus"


I want to read it as like this

+---------------+-----+---------+--------------+------------------+---------+
|       keyvalue|  rto|    state|   maker_model|          veh_type|veh_class|
+---------------+-----+---------+--------------+------------------+---------+
|hnjsnjncjssssmj| "OD"| "ODISHA"|  "BAJAJ AUTO"| "Private Vehicle"|    "Car"|
|hnjsnjncjssssjj| "OD"| "ODISHA"|  "BAJAJ AUTO"| "Private Vehicle"|    "Car"|
|hnjsnjncjssssmm| "GO"|    "GOA"| "TATA MOTORS"| "Private Vehicle"|    "Bus"|

but my pyspark is unable to recognise 2nd row properly and it's breaking it like

+--------------------+------+---------+--------------+------------------+---------+
|            keyvalue|   rto|    state|   maker_model|          veh_type|veh_class|
+--------------------+------+---------+--------------+------------------+---------+
|     hnjsnjncjssssmj|  "OD"| "ODISHA"|  "BAJAJ AUTO"| "Private Vehicle"|    "Car"|
|     hnjsnjncjssssjj|  "OD"| "ODISHA"|   "BAJAJ AUTO|              null|     null|
|", "Private Vehicle"| "Car"|     null|          null|              null|     null|
|     hnjsnjncjssssmm|  "GO"|    "GOA"| "TATA MOTORS"| "Private Vehicle"|    "Bus"|
+--------------------+------+---------+--------------+------------------+---------+

I have tried various configurations in read csv function of spark but as of now nothing is working. Please guide me?

Upvotes: 0

Views: 69

Answers (2)

Shubham Sharma
Shubham Sharma

Reputation: 71707

Spark provides some useful options for reading the csv files. In your case we can use

df = (
    spark.read
    .option('header', True)
    .option('multiline', True)
    .option("ignoreLeadingWhiteSpace", True)
    .csv('data.csv')
)

df.show()

+---------------+---+------+------------+---------------+---------+
|       keyvalue|rto| state| maker_model|       veh_type|veh_class|
+---------------+---+------+------------+---------------+---------+
|hnjsnjncjssssmj| OD|ODISHA|  BAJAJ AUTO|Private Vehicle|      Car|
|hnjsnjncjssssjj| OD|ODISHA|BAJAJ AUTO\n|Private Vehicle|      Car|
|hnjsnjncjssssmm| GO|   GOA| TATA MOTORS|Private Vehicle|      Bus|
+---------------+---+------+------------+---------------+---------+

Upvotes: 1

I. Rawlinson
I. Rawlinson

Reputation: 66

If the file isn't too large you could use regex to fix the broken lines and then read the fixed file in with spark.

import re

# matches lines where the last character is not a "
pattern = r'(?<=[^\"])\n'

with open('data.csv', 'r') as data_file:
    content = data_file.read()
    fixed = re.sub(pattern, '', content)

    with open('fixed.csv', 'w') as out_file:
        out_file.write(fixed)

Upvotes: 0

Related Questions