Reputation: 2116
Consider there is a column of array of strings in a table containing categorical data. Is there an easy way to convert this schema so there is number of categories
boolean columns representing binary encoding of that categorical column?
Example:
id type
-------------
1 [A, C]
2 [B, C]
being converted to :
id is_A is_B is_C
1 1 0 1
2 0 1 1
I know I can do this 'by hand', i.e. using:
WITH flat AS (SELECT * FROM t, unnest(type) type),
mid AS (SELECT id, (type='A') as is_A, (type='B') AS is_B, (type='C') as is_C)
SELECT id, SUM(is_A), SUM(is_B), SUM(is_C) FROM mid GROUP BY id
But I am looking for a solution that works when the number of categories is around 1-10K By the way I am using BigQuery SQL.
Upvotes: 0
Views: 3560
Reputation: 173046
looking for a solution that works when the number of categories is around 1-10K
Below is for BigQuery SQL
Step 1 - produce dynamically query (similar to one used in your question - but now it is built dynamically base on you table - yourTable
)
#standardSQL
WITH categories AS (SELECT DISTINCT cat FROM yourTable, UNNEST(type) AS cat)
SELECT CONCAT(
"WITH categories AS (SELECT DISTINCT cat FROM yourTable, UNNEST(type) AS cat), ",
"ids AS (SELECT DISTINCT id FROM yourTable), ",
"pairs AS (SELECT id, cat FROM ids CROSS JOIN categories), ",
"flat AS (SELECT id, cat FROM yourTable, UNNEST(type) cat), ",
"combinations AS ( ",
" SELECT p.id, p.cat AS col, IF(f.cat IS NULL, 0, 1) AS flag ",
" FROM pairs AS p LEFT JOIN flat AS f ",
" ON p.cat = f.cat AND p.id=f.id ",
") ",
"SELECT id, ",
STRING_AGG(CONCAT("SUM(IF(col = '", cat, "', flag, 0)) as is_", cat) ORDER BY cat),
" FROM combinations ",
"GROUP BY id ",
"ORDER BY id"
) as query
FROM categories
Step 2 - copy result of above query, paste it back to Web UI and run Query
I think you've got an idea. Yo can implement it as above purely in SQL or you can generate final query in any client of your choice
I had tried this approach of generating the query (but in Python) the problem is that query can easily reach the 256KB limit of query size in BigQuery
First, let’s see how “easily” it is to reach 256KB limit
Assuming you have 10 chars as average length of category – in this case you can cover about 4750 categories with this approach.
With 20 as average - coverage is about 3480 and for 30 – 2750
If you will "compress" sql a little by removing spaces and AS , etc. you can make it respectively: 5400, 3800, 2970 for respectively 10, 20, 30 chars
So, I would say – Yes/Agree – it most likely reach limit before 5K in real case
So, secondly, let’s see if this is actually a big of a problem!
Just as an example, assume you need 6K categories. Let’s see how you can split this to two batches (assuming that 3K scenario does work as per initial solution)
What we need to do is to split categories to two groups – just based on category names
So first group will be - BETWEEN ‘cat1’ AND ‘cat3000’
And second group will be – BETWEEN ‘cat3001’ AND ‘cat6000’
So, now run both groups with Step1 and Step2 with temp1 and temp2 tables as destination
In Step 1 – add (to the very bottom of query - after FROM categories
WHERE cat BETWEEN ‘cat1’ AND ‘cat3000’
for first batch, and
WHERE cat BETWEEN ‘cat3001’ AND ‘cat6000’
for second batch
Now, proceed to Step 3
Step 3 – Combining partial results
#standardSQL
SELECT * EXCEPT(id2)
FROM temp1 FULL JOIN (
SELECT id AS id2, * EXCEPT(id) FROM temp2
) ON id = id2
-- ORDER BY id
You can test last logic with below simple/dummy data
WITH temp1 AS (
SELECT 1 AS id, 1 AS is_A, 0 AS is_B UNION ALL
SELECT 2 AS id, 0 AS is_A, 1 AS is_B UNION ALL
SELECT 3 AS id, 1 AS is_A, 0 AS is_B
),
temp2 AS (
SELECT 1 AS id, 1 AS is_C, 0 AS is_D UNION ALL
SELECT 2 AS id, 1 AS is_C, 0 AS is_D UNION ALL
SELECT 3 AS id, 0 AS is_C, 1 AS is_D
)
Above can easily be extended to more than just two batches
Hope this helped
Upvotes: 3