Reputation: 637
I have a csv file which I am reading thru pyspark and loading into postgresql. One of its field is having strings which have coma and double quotes within the string. Like example below -
1. "RACER ""K"", P.L. 9"
2. "JENIS, B. S. ""N"" JENIS, F. T. ""B"" 5"
Pyspark is parsing it as below. Which is causing issue because it is mixing up the values/columns when I load the data into postgresql and script fail.
1. '\"RACER \"\"K\"\"'
2. '\"JENIS, B. S. \"\"N\"\" JENIS'
I am using spark 2.42. How can this situation be handled in pyspark? Basically I want to program to ignore coma or double quotes if it is coming inside double quotes.
Upvotes: 0
Views: 239
Reputation: 563
You can try and remove the comma and double quotes using pandas before reading and loading into postgresql.
You can use str.replace:
df['column_name'] = df['column_name'].str.replace(r"[\"\',]", '')
Upvotes: 1