Combine BigQuery and Pub/Sub Apache Beam

Question

I'm trying to do the following using the DataFlowRunner:

Read data from a partitioned BigQuery table (lots of data but only getting the last two days)
Read JSONs from a Pub/Sub subscription
Join both collections on a common key
Insert the joined collections to another BigQuery table

I'm pretty much new to Apache Beam so I'm not 100% sure that what I want to do is possible.

My problem comes when I try to join both rows, after using the CoGroupByKey transform, it seems like the data never arrives at the same time although the windowing strategy is the same (30sec fixed window, end of window trigger and discarding fired panes).

Some relevant chunks of my code:

    /* Getting the data and windowing */
    PCollection pubsub = p.apply("ReadPubSub sub",PubsubIO.readMessages().fromSubscription(SUB_ALIM_REC));

    String query = /* The query */
    PCollection bqData = p.apply("Reading BQ",BigQueryIO.readTableRows().fromQuery(query).usingStandardSql())
            .apply(Window.into(FixedWindows.of(Duration.standardSeconds(30)))
                    .triggering(AfterWatermark.pastEndOfWindow())
                    .withAllowedLateness(Duration.standardSeconds(0)).accumulatingFiredPanes());        

    PCollection tableRow = pubsub.apply(Window.into(FixedWindows.of(Duration.standardSeconds(120)))
            .triggering(AfterWatermark.pastEndOfWindow())
            .withAllowedLateness(Duration.standardSeconds(0)).accumulatingFiredPanes())
            .apply("JSON to TableRow",ParDo.of(new ToTableRow()));



    /*  Join code   */  
    PCollection finalResultCollection =
                kvpCollection.apply("Join TableRows", ParDo.of(
                        new DoFn,  TableRow>() {
                            private static final long serialVersionUID = 6627878974147676533L;

                    @ProcessElement
                    public void processElement(ProcessContext c) {
                        KV e = c.element();
                        Long idPaquete = e.getKey();
                        Iterable it1 = e.getValue().getAll(packTag);
                        Iterable it2 = e.getValue().getAll(alimTag);
                        for(TableRow t1 : itPaq) {
                            for (TableRow t2 : itAlimRec) {
                                TableRow joinedRow = new TableRow();
                                /* set the required fields from each collection */
                                c.output(joinedRow);
                            }

                        }
                    }
                }));

Also in the past two days I've been getting this error:

java.io.IOException: Failed to start reading from source: org.apache.beam.runners.core.construction.UnboundedReadFromBoundedSource$BoundedToUnboundedSourceAdapter@2808d228
        com.google.cloud.dataflow.worker.WorkerCustomSources$UnboundedReaderIterator.start(WorkerCustomSources.java:783)
        com.google.cloud.dataflow.worker.util.common.worker.ReadOperation$SynchronizedReaderIterator.start(ReadOperation.java:360)
        com.google.cloud.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:193)
        com.google.cloud.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:158)
        com.google.cloud.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:75)
        com.google.cloud.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1227)
        com.google.cloud.dataflow.worker.StreamingDataflowWorker.access$1000(StreamingDataflowWorker.java:135)
        com.google.cloud.dataflow.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:966)
        java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.UnsupportedOperationException: BigQuery source must be split before being read
        org.apache.beam.sdk.io.gcp.bigquery.BigQuerySourceBase.createReader(BigQuerySourceBase.java:153)
        org.apache.beam.runners.core.construction.UnboundedReadFromBoundedSource$BoundedToUnboundedSourceAdapter$ResidualSource.advance(UnboundedReadFromBoundedSource.java:463)
        org.apache.beam.runners.core.construction.UnboundedReadFromBoundedSource$BoundedToUnboundedSourceAdapter$ResidualSource.access$300(UnboundedReadFromBoundedSource.java:442)
        org.apache.beam.runners.core.construction.UnboundedReadFromBoundedSource$BoundedToUnboundedSourceAdapter$Reader.advance(UnboundedReadFromBoundedSource.java:293)
        org.apache.beam.runners.core.construction.UnboundedReadFromBoundedSource$BoundedToUnboundedSourceAdapter$Reader.start(UnboundedReadFromBoundedSource.java:286)
        com.google.cloud.dataflow.worker.WorkerCustomSources$UnboundedReaderIterator.start(WorkerCustomSources.java:778)
        com.google.cloud.dataflow.worker.util.common.worker.ReadOperation$SynchronizedReaderIterator.start(ReadOperation.java:360)
        com.google.cloud.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:193)
        com.google.cloud.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:158)
        com.google.cloud.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:75)
        com.google.cloud.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1227)
        com.google.cloud.dataflow.worker.StreamingDataflowWorker.access$1000(StreamingDataflowWorker.java:135)
        com.google.cloud.dataflow.worker.StreamingDataflowWorker$6.run(StreamingDataflowWorker.java:966)
        java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        java.lang.Thread.run(Thread.java:745)

I'd really appreciate your guidance to know if what I'm trying to do is possible or if there's an alternative to solve this scenario.

Combine BigQuery and Pub/Sub Apache Beam

Answers (1)

Related Questions