Reputation: 101
Currently I am using spark to read data from bigqiery tables and write it to storage bucket as csv. One issue that i am facing is that the null string values are not being read properly by spark from bq. It reads the null string values but in the csv it writes that value as an empty string with double quotes (i.e. like this "").
# Load data from BigQuery.
bqdf = spark.read.format('bigquery') \
.option('table', <bq_dataset> + <bq_table>) \
.load()
bqdf.createOrReplaceTempView('bqdf')
# Select required data into another df
bqdf2 = spark.sql(
'SELECT * FROM bqdf')
# write to GCS
bqdf2.write.csv(<gcs_data_path> + <bq_table> + '/' , mode='overwrite', sep= '|')
I have tried emptyValue='' and nullValue options with df.write.csv() while writing to csv but dosen't work.
I needed a solution for this problem, if anyone else faced this issue and could help. Thanks!
Upvotes: 0
Views: 1051
Reputation: 4051
I was able to reproduce your case and I found a solution that worked with a sample table I created in BigQuery. The data is as follows:
According to the PySpark documentation, in the class pyspark.sql.DataFrameWriter(df), there is an option called nullValue:
nullValue – sets the string representation of a null value. If None is set, it uses the default value, empty string.
Which is what you are looking for. Then, I just implemented nullValue option below.
sc = SparkContext()
spark = SparkSession(sc)
# Read the data from BigQuery as a Spark Dataframe.
data = spark.read.format("bigquery").option(
"table", "dataset.table").load()
# Create a view so that Spark SQL queries can be run against the data.
data.createOrReplaceTempView("data_view")
# Select required data into another df
data_view2 = spark.sql(
'SELECT * FROM data_view')
df=data_view2.write.csv('gs://bucket/folder', header=True, nullValue='')
data_view2.show()
Notice that I have used data_view2.show()
to print out the view in order to check if it was correctly read. The output was:
+------+---+
|name |age|
+------+---+
|Robert| 25|
|null | 23|
+------+---+
Therefore, the null value was precisely interpreted. In addition, I also checked the .csv file:
name,age
Robert,25
,23
As you can see the null value is correct and not represented as between double quotes as an empty String. Finally, just as a final inspection I created a load job from this .csv file to BigQuery. The table was created and the null value was interpreted accurately.
Note: I ran the pyspark job from the DataProc job's console in a DataProc cluster, previously created. Also, the cluster was at the same location as the dataset in BigQuery.
Upvotes: 0