George Fields
George Fields

Reputation: 112

Snowflake: How do I update a column with values taken at random from another table?

I've been struggling with this for a while now. Imagine I have these two tables:

CREATE TEMPORARY TABLE tmp_target AS (
  SELECT * FROM VALUES
    ('John', 43, 'm', 17363)
  , ('Mark', 21, 'm', 16354)
  , ('Jean', 25, 'f', 74615)
  , ('Sara', 63, 'f', 26531)
  , ('Alyx', 32, 'f', 42365)
  AS target (name, age, gender, zip)
);

and

CREATE TEMPORARY TABLE tmp_source AS (
  SELECT * FROM VALUES
    ('Cory', 42, 'm', 15156)
  , ('Fred', 51, 'm', 71451)
  , ('Mimi', 22, 'f', 45624)
  , ('Matt', 61, 'm', 12734)
  , ('Olga', 19, 'f', 52462)
  , ('Cleo', 29, 'f', 23352)
  , ('Simm', 31, 'm', 62445)
  , ('Mona', 37, 'f', 23261)
  , ('Feng', 44, 'f', 64335)
  , ('King', 57, 'm', 12225)
  AS source (name, age, gender, zip)
);

I would like to update the tmp_target table by taking 5 rows at random from the tmp_source table for the column(s) I'm interested in. For example, maybe I want to replace all the names with 5 random names from tmp_source, or maybe I want to replace the names and the ages.

My first attempt was this:

UPDATE tmp_target t SET t.name = s.name FROM tmp_source s;

However, when I examine the target table, I notice that quite a few of the names are duplicated, usually in pairs. As well, Snowflake gives me number of rows updated: 5 as well as number of multi-joined rows updated: 5. I believe this is due to the non-deterministic nature of what's happening, possibly as noted in the Snowflake documentation on updates. Not to mention I get the nagging feeling that this is somehow horribly inefficient if the tables had many records.

Then I tried something to grab 5 random rows from the source table:

UPDATE tmp_target t SET t.name = cte.name
FROM (
  WITH upd AS (SELECT name FROM tmp_source SAMPLE ROW (5 ROWS))
  SELECT name FROM upd
) AS cte;

But I seem to run into the exact same issue, both when I examine the target table, and as reported by the number of multi-joined rows. I was wondering if I can use row numbering somehow, but while I can generate row numbers in the subquery, I don't know how to do that in the SET part of the outside query.

I want to add that neither table has any identifiers or indexes that can be used, and I'm looking for a solution that wouldn't require any.

I would very much appreciate it if anyone can provide solutions or ideas that are as clean and tidy as possible, with some consideration given to efficiency (imagine a target table of 100K rows and a source table of 10M rows). Thank you!

Upvotes: 2

Views: 4668

Answers (3)

Rich Murnane
Rich Murnane

Reputation: 2940

Two example as follows, the first uses a temporary table to house the joined data by a rownum, the second include everything in the one query, note I used UPPER and lower case strings to make sure the records were being updated the way I wanted.

CREATE OR REPLACE TEMPORARY TABLE tmp_target AS (
  SELECT * FROM VALUES
    ('John', 43, 'm', 17363)
  , ('Mark', 21, 'm', 16354)
  , ('Jean', 25, 'f', 74615)
  , ('Sara', 63, 'f', 26531)
  , ('Alyx', 32, 'f', 42365)
  AS target (name, age, gender, zip)
);

CREATE OR REPLACE TEMPORARY TABLE tmp_source AS (
  SELECT * FROM VALUES
    ('CORY', 42, 'M', 15156)
  , ('FRED', 51, 'M', 71451)
  , ('MIMI', 22, 'F', 45624)
  , ('MATT', 61, 'M', 12734)
  , ('OLGA', 19, 'F', 52462)
  , ('CLEO', 29, 'F', 23352)
  , ('SIMM', 31, 'M', 62445)
  , ('MONA', 37, 'F', 23261)
  , ('FENG', 44, 'F', 64335)
  , ('KING', 57, 'M', 12225)
  AS source (name, age, gender, zip)
);

CREATE OR REPLACE TEMPORARY TABLE t1 as (
with src as (
SELECT tmp_source.*, row_number() over (order by 1) tmp_id
FROM   tmp_source SAMPLE ROW (5 ROWS)),
     tgt as (
SELECT tmp_target.*, row_number() over (order by 1) tmp_id
FROM   tmp_target SAMPLE ROW (5 ROWS))
SELECT src.name as src_name,
       src.age as src_age,
       src.gender as src_gender,
       src.zip as src_zip,
       src.tmp_id as tmp_id,
       tgt.name as tgt_name,
       tgt.age as tgt_age,
       tgt.gender as tgt_gender,
       tgt.zip as tgt_zip
FROM src, tgt 
WHERE src.tmp_id = tgt.tmp_id);

UPDATE tmp_target a
SET a.name  = b.src_name,
    a.gender = b.src_gender
FROM (SELECT * FROM  t1) b
WHERE a.name = b.tgt_name 
AND   a.age = b.tgt_age
AND   a.gender = b.tgt_gender 
AND   a.zip = b.tgt_zip;




UPDATE tmp_target a
SET a.name  = b.src_name,
    a.gender = b.src_gender
FROM (
with src as (
SELECT tmp_source.*, row_number() over (order by 1) tmp_id
FROM   tmp_source SAMPLE ROW (5 ROWS)),
     tgt as (
SELECT tmp_target.*, row_number() over (order by 1) tmp_id
FROM   tmp_target SAMPLE ROW (5 ROWS))
SELECT src.name as src_name,
       src.age as src_age,
       src.gender as src_gender,
       src.zip as src_zip,
       src.tmp_id as tmp_id,
       tgt.name as tgt_name,
       tgt.age as tgt_age,
       tgt.gender as tgt_gender,
       tgt.zip as tgt_zip
FROM src, tgt 
WHERE src.tmp_id = tgt.tmp_id) b
WHERE a.name = b.tgt_name 
AND   a.age = b.tgt_age
AND   a.gender = b.tgt_gender 
AND   a.zip = b.tgt_zip;

Upvotes: 3

David Garrison
David Garrison

Reputation: 2880

At a first pass, this is all that came to mind. I'm not sure if it suits your example perfectly, since it involves reloading the table.

It should be comparably performant to any other solution that uses a generated rownum. At least to my knowledge, in Snowflake, an update is no more performant than an insert (at least in this case where you're touching every record, and every micropartition, regardless).

INSERT OVERWRITE INTO tmp_target
with target as (
  select 
      age,
      gender,
      zip,
      row_number() over (order by 1) rownum
  from tmp_target
)
,source as (
  select 
      name,
      row_number() over (order by 1) rownum
  from tmp_source
  SAMPLE ROW (5 ROWS)
)

SELECT
    s.name,
    t.age,
    t.gender,
    t.zip
from target t
join source s on t.rownum = s.rownum;

Upvotes: 2

Felipe Hoffa
Felipe Hoffa

Reputation: 59225

I like the two answers already provided, but let me give you a simple answer to solve the simple case:

UPDATE tmp_target t 
SET t.name = (
        select array_agg(s.name) possible_names
        from tmp_source s
    )[uniform(0, 9, random())]
;

The secret of this solution is building an array of possible values, and choosing one at random for each updated row.


Update: Now with a JavaScript UDF that will help us choose each name from source only once

create or replace function incremental_thing()
returns float 
language javascript
as
$$
if (typeof(inc) === "undefined") inc = 0;
return inc++;
$$
;

UPDATE tmp_target t 
SET t.name = (
        select array_agg(s.name) within group (order by random())
        from tmp_source s
    )[incremental_thing()::integer]
;

Note that the JS UDF returns an incremental value each time it’s called, and that helps me choose the next value from a sorted array to use on an update.

Since the value is incremented inside the JS UDF, this will work as long as there's only one JS env involved. To for single-node processing and avoid parallelism choose an XS warehouse and test.

Upvotes: 4

Related Questions