S.Mohsen sh
S.Mohsen sh

Reputation: 2116

Convert a categorical column to binary representation in SQL

Consider there is a column of array of strings in a table containing categorical data. Is there an easy way to convert this schema so there is number of categories boolean columns representing binary encoding of that categorical column?

Example:

id      type
-------------
1       [A, C]
2       [B, C]

being converted to :

id    is_A     is_B    is_C
1     1        0       1
2     0        1       1

I know I can do this 'by hand', i.e. using:

WITH flat AS (SELECT * FROM t, unnest(type) type),
mid AS (SELECT id, (type='A') as is_A, (type='B') AS is_B, (type='C') as is_C)
SELECT id, SUM(is_A), SUM(is_B), SUM(is_C) FROM mid GROUP BY id

But I am looking for a solution that works when the number of categories is around 1-10K By the way I am using BigQuery SQL.

Upvotes: 0

Views: 3560

Answers (1)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173046

looking for a solution that works when the number of categories is around 1-10K

Below is for BigQuery SQL

Step 1 - produce dynamically query (similar to one used in your question - but now it is built dynamically base on you table - yourTable)

#standardSQL
WITH categories AS (SELECT DISTINCT cat FROM yourTable, UNNEST(type) AS cat)
SELECT CONCAT(
  "WITH categories AS (SELECT DISTINCT cat FROM yourTable, UNNEST(type) AS cat), ",
  "ids AS (SELECT DISTINCT id FROM yourTable), ",
  "pairs AS (SELECT id, cat FROM ids CROSS JOIN categories), ",
  "flat AS (SELECT id, cat FROM yourTable, UNNEST(type) cat), ",
  "combinations AS ( ",
  "  SELECT p.id, p.cat AS col, IF(f.cat IS NULL, 0, 1) AS flag ",
  "  FROM pairs AS p LEFT JOIN flat AS f ",
  "  ON p.cat = f.cat AND p.id=f.id ",
  ") ",
  "SELECT id, ",
  STRING_AGG(CONCAT("SUM(IF(col = '", cat, "', flag, 0)) as is_", cat) ORDER BY cat),
  " FROM combinations ",
  "GROUP BY id ",
  "ORDER BY id"
) as query
FROM categories  

Step 2 - copy result of above query, paste it back to Web UI and run Query

I think you've got an idea. Yo can implement it as above purely in SQL or you can generate final query in any client of your choice

I had tried this approach of generating the query (but in Python) the problem is that query can easily reach the 256KB limit of query size in BigQuery

First, let’s see how “easily” it is to reach 256KB limit
Assuming you have 10 chars as average length of category – in this case you can cover about 4750 categories with this approach.
With 20 as average - coverage is about 3480 and for 30 – 2750

If you will "compress" sql a little by removing spaces and AS , etc. you can make it respectively: 5400, 3800, 2970 for respectively 10, 20, 30 chars

So, I would say – Yes/Agree – it most likely reach limit before 5K in real case

So, secondly, let’s see if this is actually a big of a problem!
Just as an example, assume you need 6K categories. Let’s see how you can split this to two batches (assuming that 3K scenario does work as per initial solution)
What we need to do is to split categories to two groups – just based on category names
So first group will be - BETWEEN ‘cat1’ AND ‘cat3000’
And second group will be – BETWEEN ‘cat3001’ AND ‘cat6000’

So, now run both groups with Step1 and Step2 with temp1 and temp2 tables as destination
In Step 1 – add (to the very bottom of query - after FROM categories

WHERE cat BETWEEN ‘cat1’ AND ‘cat3000’   

for first batch, and

WHERE cat BETWEEN ‘cat3001’ AND ‘cat6000’   

for second batch

Now, proceed to Step 3

Step 3 – Combining partial results

#standardSQL
SELECT * EXCEPT(id2)
FROM temp1 FULL JOIN (
  SELECT id AS id2, * EXCEPT(id) FROM temp2
) ON id = id2
-- ORDER BY id

You can test last logic with below simple/dummy data

WITH temp1 AS (
  SELECT 1 AS id, 1 AS is_A, 0 AS is_B UNION ALL   
  SELECT 2 AS id, 0 AS is_A, 1 AS is_B UNION ALL   
  SELECT 3 AS id, 1 AS is_A, 0 AS is_B    
),
temp2 AS (
  SELECT 1 AS id, 1 AS is_C, 0 AS is_D UNION ALL   
  SELECT 2 AS id, 1 AS is_C, 0 AS is_D UNION ALL
  SELECT 3 AS id, 0 AS is_C, 1 AS is_D    
)

Above can easily be extended to more than just two batches

Hope this helped

Upvotes: 3

Related Questions