Reputation: 41
I want to write a Dataflow program(Java and maven Implementation). Here are the steps I want to perform:
Dataflow should read a csv file from google cloud storage. The csv file is in following format:
Product Name, Image URL, Category, Description1, Description2 Sakura 30062 6-Piece Pigma Micron Ink Pen Set, https://images-na.ssl-images-amazon.com/images/I/71CkvpG3FEL.SY355.jpg, Arts, Includes 1 of size: #005 (0.20mm)
CCbetter Mini Hot Melt Glue Gun with 25pcs Glue Sticks High Temperature Melting Glue Gun Kit Flexible Trigger for DIY Small Craft Projects&Sealing and Quick Repairs(20-watt, Blue), https://images-na.ssl-images-amazon.com/images/I/61iFrMg4%2B3L.SY355.jpg, Safety and comfortable power switch with LED light mode. With detachable and flexible support to keep the gun stable and upright, With high quality and insulated nozzle there's no deforming of the gun even long-term use under 500℉.
. . . .
For each of the rows in csv, I need to pick the picture URL and run vision API and get top 2 labels(e.g. we get labels L1 and L2 from vision API for first product/row and L3 and L4 for second product/row)
For each of the row in csv, I need to concatenate product name, category, description1 and description2 and pass it to NL API. From the response of NL API I need to pick top 2 Entities under Consumer Goods category (e.g. we get E1 and E2 from first row and E3 and E4 for second row)
I need to create following structure from retrieved response:
Product Name, Topic Sakura 30062 6-Piece Pigma Micron Ink Pen Set, L1 Sakura 30062 6-Piece Pigma Micron Ink Pen Set, L2 Sakura 30062 6-Piece Pigma Micron Ink Pen Set, E1 Sakura 30062 6-Piece Pigma Micron Ink Pen Set, E2
CCbetter Mini Hot Melt Glue Gun with 25pcs Glue Sticks High Temperature Melting Glue Gun Kit Flexible Trigger for DIY Small Craft Projects&Sealing and Quick Repairs(20-watt, Blue), L3 CCbetter Mini Hot Melt Glue Gun with 25pcs Glue Sticks High Temperature Melting Glue Gun Kit Flexible Trigger for DIY Small Craft Projects&Sealing and Quick Repairs(20-watt, Blue), L4 CCbetter Mini Hot Melt Glue Gun with 25pcs Glue Sticks High Temperature Melting Glue Gun Kit Flexible Trigger for DIY Small Craft Projects&Sealing and Quick Repairs(20-watt, Blue), E3 CCbetter Mini Hot Melt Glue Gun with 25pcs Glue Sticks High Temperature Melting Glue Gun Kit Flexible Trigger for DIY Small Craft Projects&Sealing and Quick Repairs(20-watt, Blue), E4 . . . .
I want to write this grid(structure in step 4) to Bigquery table
I am new to Dataflow so any help, code snippet or whole source code or reference is highly appreciated
Upvotes: 0
Views: 2650
Reputation: 6130
You should start by reading one of the quick start guides, and taking a look at some of the example pipelines.
Based on your description, a high-level outline might be:
TextIO.read
to read content from GCS. Note that it doesn't support ignoring the header, so you'll likely need to detect and drop it yourself.DoFn
that uses the vision API on the URL from each row of the file. You could even separate this into multiple DoFn
s -- one to transform the row into a URL, then a DoFn to use the vision API, then a DoFn to extract the top two tags.DoFn
or series of DoFn
s that performs the concatenation and uses the NL API.DoFn
or series of DoFn
s that generate rows with your desired output format as TableRow
s.BigQueryIO.write
transform to write those to BigQuery. Upvotes: 4