Reputation: 7146
dear all
I started to use BigQuery to analysis data in GAE datastore this month. Firstly, I export data via "Datastore Admin" page of GAE console to Google Cloud Storage. And then, I import the data from Google Cloud Storage to BigQuery. It works very smoothly excepted the repeated structured property. I expected the imported record should be in the format of:
parent:"James",
children: [{
name: "name1",
age: 5,
gender: "M"
}, {
name: "name2",
age: 50,
gender: "F"
}, {
name: "name3",
age: 33,
gender: "M"
},
]
I know how to flatten data in above format. But the actual data format in BigQuery seems in following format:
parent: "James",
children.name:["name1", "name2", "name3"],
children.age:[5, 50, 33],
children.gender:["M", "F", "M"],
I'm wondering if it's possible to flatten above data in BigQuery for further analysis. The ideal format of result table in my mind is:
parentName, children.name, children.age, children.gender
James, name1, 5, "M"
James, name2, 50, "F"
James, name3, 33, "M"
Cheers!
Upvotes: 4
Views: 954
Reputation: 173191
With recently introduced BigQuery Standard SQL - things are so much nicer!
Try below (make sure to uncheck Use Legacy SQL
checkbox under Show Options)
WITH parents AS (
SELECT
"James" AS parentName,
STRUCT(
["name1", "name2", "name3"] AS name,
[5, 50, 33] AS age,
["M", "F", "M"] AS gender
) AS children
)
SELECT
parentName, childrenName, childrenAge, childrenGender
FROM
parents,
UNNEST(children.name) AS childrenName WITH OFFSET AS pos_name,
UNNEST(children.age) AS childrenAge WITH OFFSET AS pos_age,
UNNEST(children.gender) AS childrenGender WITH OFFSET AS pos_gender
WHERE
pos_name = pos_age AND pos_name = pos_gender
Here - original table - parents
- has below data
with respective schema
as
[{
"parentName": "James",
"children": {
"name": ["name1", "name2", "name3"],
"age": ["5", "50", "33" ],
"gender": ["M", "F", "M"]
}
}]
and the output
is
Note: above is solely based on what I see in original question and most likely needs to be adjusted to whatever specific needs you have
Hope this helps in terms of direction to go and where to start with!
Added:
Above Query is using row based CROSS JOINS, meaning all the variations for same parent first assembled and than WHERE clause filters out "wrong" ones.
In contrast, below version, use INNER JOIN to eliminate this "side effect"
WITH parents AS (
SELECT
"James" AS parentName,
STRUCT(
["name1", "name2", "name3"] AS name,
[5, 50, 33] AS age,
["M", "F", "M"] AS gender
) AS children
)
SELECT
parentName, childrenName, childrenAge, childrenGender
FROM
parents, UNNEST(children.name) AS childrenName WITH OFFSET AS pos_name
JOIN UNNEST(children.age) AS childrenAge WITH OFFSET AS pos_age
ON pos_name = pos_age
JOIN UNNEST(children.gender) AS childrenGender WITH OFFSET AS pos_gender
ON pos_age = pos_gender
Intuitively, I would expect second version to be little more efficient for bigger table
Upvotes: 4
Reputation: 26637
You should be able to use the 'large query results' feature to generate a new flattened table. Unfortunately, the syntax is terrifying. The basic principle is that you want to flatten each of the fields and save off the position, then filter where the position is the same. Try something like:
SELECT parentName, children.name, children.age, children.gender,
position(children.name) as name_pos,
position(children.age) as age_pos,
position(children.gender) as gender_pos,
FROM table
SELECT
parent,
children.name,
children.age,
children.gender,
pos
FROM (
SELECT
parent,
children.name,
children.age,
children.gender,
gender_pos,
pos
FROM (
FLATTEN((
SELECT
parent,
children.name,
children.age,
children.gender,
pos,
POSITION(children.gender) as gender_pos
FROM (
SELECT
parent,
children.name,
children.age,
children.gender,
pos,
FROM (
FLATTEN((
SELECT
parent,
children.name,
children.age,
children.gender,
pos,
POSITION(children.age) AS age_pos
FROM (
FLATTEN((
SELECT
parent,
children.name,
children.age,
children.gender,
POSITION(children.name) AS pos
FROM table
),
children.name))),
children.age))
WHERE
age_pos = pos)),
children.gender)))
WHERE
gender_pos = pos;
To allow large results, if you are using the BigQuery UI, you should click the 'advanced options' button, specify a destination table, and check the 'allow large results' flag.
Note that if your data is stored as an entity that has a nested record that looks like {name, age, gender}, we should be transforming this into a nested record in bigquery instead of parallel arrays. I'll look into why this is happening.
Upvotes: 1