arya.s
arya.s

Reputation: 121

Unable to trim empty space in pyspark dataframe

While loading data from Oracle and writing to PostgreSQL facing weird issue. Unable to write string with space to postgres. Facing below issue

Caused by: java.sql.BatchUpdateException: Batch entry 0 INSERT INTO xyz("col1","col2") VALUES ('7643'::numeric,'xyz/xyz xyzxy xyz/xyz xyzxy ') was aborted: ERROR: invalid byte sequence for encoding "UTF8": 0x00  Call getNextException to see other errors in the batch

So trying to trim col in dataframe but that is not working. Before and after trimming data is same.

data= data.withColumn("trimmed", trim(col("col2")))

Am very new to pyspark and data cleaning, any help is highly appreciated.

Upvotes: 0

Views: 1634

Answers (1)

danielsepulvedab
danielsepulvedab

Reputation: 674

Probably the trim is working just fine. The trim function just removes spaces from both ends of the stream. And actually your problem is not that. The issue is that Postgres doesn't accept the NULL character (i.e. 0x00, check this), and it looks like you have some in your col2. Try cleaning those first with something like:

from pyspark.sql.functions import regexp_replace, col
data = data.withColumn("col2", regexp_replace(col("col2"), "\u0000", ""))

Upvotes: 1

Related Questions