Reputation: 55
I currently have some reddit data on Google BigQuery that I want to do a word count on for all the comments on a selection of subreddits. The query is about 90GiB so it isn't possible to load into DataLab directly and turn into a data frame. I've been advised to use use a Hadoop or Spark job in DataProc to create a word count and to set up a connector to get BigQuery data into DataProc so that DataProc can do the word count. How do I run this in DataLab?
Upvotes: 1
Views: 244
Reputation: 26458
Here is an example PySpark code for WordCount with the public BigQuery shakespeare dataset:
#!/usr/bin/env python
"""BigQuery I/O PySpark example."""
from pyspark.sql import SparkSession
# Use the Cloud Storage bucket for temporary BigQuery export data used
# by the connector.
bucket=sys.argv[1]
spark.conf.set('temporaryGcsBucket', bucket)
spark = SparkSession \
.builder \
.master('yarn') \
.appName('spark-bigquery-demo') \
.getOrCreate()
# Load data from BigQuery.
words = spark.read.format('bigquery') \
.option('table', 'bigquery-public-data:samples.shakespeare') \
.load()
words.createOrReplaceTempView('words')
# Perform word count.
word_count = spark.sql(
'SELECT word, SUM(word_count) AS word_count FROM words GROUP BY word')
word_count.show()
word_count.printSchema()
# Saving the data to BigQuery
word_count.write.format('bigquery') \
.option('table', 'wordcount_dataset.wordcount_output') \
.save()
You can save the script locally or in a GCS bucket, then submit it to a Dataproc cluster with:
gcloud dataproc jobs submit pyspark --cluster=<cluster> <pyspark-script> -- <bucket>
Also check this doc for more info.
Upvotes: 1