Chirag Giri
Chirag Giri

Reputation: 41

Dataflow pipeline reading csv from GCS and writing to BigBuery with calls to Vision and NL API

I want to write a Dataflow program(Java and maven Implementation). Here are the steps I want to perform:

  1. Dataflow should read a csv file from google cloud storage. The csv file is in following format:

    Product Name, Image URL, Category, Description1, Description2 Sakura 30062 6-Piece Pigma Micron Ink Pen Set, https://images-na.ssl-images-amazon.com/images/I/71CkvpG3FEL.SY355.jpg, Arts, Includes 1 of size: #005 (0.20mm)

    CCbetter Mini Hot Melt Glue Gun with 25pcs Glue Sticks High Temperature Melting Glue Gun Kit Flexible Trigger for DIY Small Craft Projects&Sealing and Quick Repairs(20-watt, Blue), https://images-na.ssl-images-amazon.com/images/I/61iFrMg4%2B3L.SY355.jpg, Safety and comfortable power switch with LED light mode. With detachable and flexible support to keep the gun stable and upright, With high quality and insulated nozzle there's no deforming of the gun even long-term use under 500℉.

    . . . .

  2. For each of the rows in csv, I need to pick the picture URL and run vision API and get top 2 labels(e.g. we get labels L1 and L2 from vision API for first product/row and L3 and L4 for second product/row)

  3. For each of the row in csv, I need to concatenate product name, category, description1 and description2 and pass it to NL API. From the response of NL API I need to pick top 2 Entities under Consumer Goods category (e.g. we get E1 and E2 from first row and E3 and E4 for second row)

  4. I need to create following structure from retrieved response:

    Product Name, Topic Sakura 30062 6-Piece Pigma Micron Ink Pen Set, L1 Sakura 30062 6-Piece Pigma Micron Ink Pen Set, L2 Sakura 30062 6-Piece Pigma Micron Ink Pen Set, E1 Sakura 30062 6-Piece Pigma Micron Ink Pen Set, E2

    CCbetter Mini Hot Melt Glue Gun with 25pcs Glue Sticks High Temperature Melting Glue Gun Kit Flexible Trigger for DIY Small Craft Projects&Sealing and Quick Repairs(20-watt, Blue), L3 CCbetter Mini Hot Melt Glue Gun with 25pcs Glue Sticks High Temperature Melting Glue Gun Kit Flexible Trigger for DIY Small Craft Projects&Sealing and Quick Repairs(20-watt, Blue), L4 CCbetter Mini Hot Melt Glue Gun with 25pcs Glue Sticks High Temperature Melting Glue Gun Kit Flexible Trigger for DIY Small Craft Projects&Sealing and Quick Repairs(20-watt, Blue), E3 CCbetter Mini Hot Melt Glue Gun with 25pcs Glue Sticks High Temperature Melting Glue Gun Kit Flexible Trigger for DIY Small Craft Projects&Sealing and Quick Repairs(20-watt, Blue), E4 . . . .

  5. I want to write this grid(structure in step 4) to Bigquery table

I am new to Dataflow so any help, code snippet or whole source code or reference is highly appreciated

Upvotes: 0

Views: 2650

Answers (1)

Ben Chambers
Ben Chambers

Reputation: 6130

You should start by reading one of the quick start guides, and taking a look at some of the example pipelines.

Based on your description, a high-level outline might be:

  1. Use TextIO.read to read content from GCS. Note that it doesn't support ignoring the header, so you'll likely need to detect and drop it yourself.
  2. Write a DoFn that uses the vision API on the URL from each row of the file. You could even separate this into multiple DoFns -- one to transform the row into a URL, then a DoFn to use the vision API, then a DoFn to extract the top two tags.
  3. Write another DoFn or series of DoFns that performs the concatenation and uses the NL API.
  4. Write another DoFn or series of DoFns that generate rows with your desired output format as TableRows.
  5. Use a BigQueryIO.write transform to write those to BigQuery.

Upvotes: 4

Related Questions