moirK
moirK

Reputation: 681

Write repeated Strings to BigQuery using Apache Beam

I have a data stream containing Strings which look like JSONArrays. I want to parse those Strings and write to BigQuery table using Apache Beam but am getting an error while writing repeated Strings.

Here´s how I convert my string to TableRow:

    String dataString = "[{\"EMAIL\": [\"[email protected]\"]}]";

    JSONArray jsonArray = new JSONArray(dataString);
    TableRow tableRow = new TableRow();

    for (int i = 0; i < jsonArray.length(); i++) {
      JSONArray emailArray = new JSONArray(jsonArray.getJSONObject(i).get("EMAIL").toString());

      tableRow.set("EMAIL", emailArray); //Results in error
    }

Here´s what my BigQuery schema looks like:

[
  {
    "name": "EMAIL",
    "type": "STRING",
    "mode": "REPEATED"
  }
]

I have managed to write a similar repeated String to BigQuery table using Python but unable to do it using Apache Beam. I suppose I am not saving the right key-value pair in TableRow. The error I am getting now is:

java.io.IOException: Insert failed: [{"errors":[{"debugInfo":"","location":"email","message":"This field is not a record.","reason":"invalid"}],"index":0}]

I need help regarding how to save a similar repeated String to BigQuery without creating a record and would appreciate any advice or suggestions. Thanks in advance.

Upvotes: 3

Views: 865

Answers (1)

Jonny5
Jonny5

Reputation: 1499

It seems you want to create

  1. one row with a concatenated String of email addresses, or
  2. a row per email, or
  3. one row with a repeated field.

Note that is seems your ValidFrom field is of type STRING, not a repeated field, unless it is wrapped in a repeated field in a hierarchical schema.

In the example code you provided, you are creating a JSONArray and putting it into the STRING field, which I think cause issues as the types are incompatible. If you want to keep it as a plain STRING field, you can use Solution 1 below.

Also make sure that the name of your column in BigQuery matches the one in your code, I see you use both ValidFrom and EMAIL (might be a mistake in your posted code though).

Solution 1: One row with String field

In case you want to add one row with a concatenated String field in BigQuery, you can use the following:

// Initialize your final row
TableRow tableRow = new TableRow();

// Find email addresses
String [] emails = ... // your extraction logic

// Build a concatenated string of emails
String allEmails = String.join(";", emails);

// Add the string field to the row
tableRow.set('EMAILS', allEmails);

Solution 2: Multiple rows with String field

In case you want to insert multiple rows, you your create multiple table rows:

// Find email addresses
String [] emails = ... // your extraction logic

// Build a row per email
for(String email: emails) {
    // Initialize your final row
    TableRow tableRow = new TableRow();
    tableRow.set('EMAIL', email);
    
    // TODO: do something with the row (add to list, or ...)
}

Solution 3: One row with REPEATED field

In case you want to add one row with a REPEATED STRING field in BigQuery, you can use the following:

// Initialize your final row
TableRow tableRow = new TableRow();

// Find email addresses
String [] emails = ... // your extraction logic

// Build the repeated field
List<String> emailCells = new ArrayList<>();
for(String email: emails) {
    emailCells.add(email);
}

// Add the repeated field to the row
tableRow.set('EMAILS', emailCells);

If this is not what you're aiming for, please provide some more details.

Upvotes: 2

Related Questions