angsty_robot
angsty_robot

Reputation: 315

BigQuery Left join on list of strings returns null

I couldn't come up with a good title for this question. Sorry about that.

From a collection of emails, I want to check which emails are from a list of unknown users. However, when I do a left join like below, the join doesn't seem to match and I get NULL values in tb2.email instead.

#StandardSQL 
with unknown_addresses AS (
  select '[email protected]' AS email
  union all
  select '[email protected]' AS email
  union all
  select '[email protected]' AS email
  union all
  select '[email protected]' AS email
  union all
  select '[email protected]' AS email
),

all_emails_received as (
  SELECT
    SUBSTR(REGEXP_EXTRACT(mystring, r"[a-z.@\-]+"), 4) as email,
    content
  FROM
    foobar 
)

SELECT tb1.email, tb1.content, tb2.email AS unknown_email 
FROM all_emails_received tb1 
  LEFT JOIN unknown_addresses tb2 
    ON tb1.email = tb2.email

How do I make the join condition work?

I don't want to create a new table just to store unknown_addresses.

(Edit) Added sample of raw data

| email           | content     | 
| --------------- | ----------- | 
| [email protected] | Hello there | 
| [email protected]  | test test   | 
| [email protected]   | foobar      |  

What I'm currently getting

| email           | content     | unknown_email   | 
| --------------- | ----------- | --------------- |
| [email protected] | Hello there | null            |
| [email protected]  | test test   | null            |   
| [email protected]   | foobar      | null            | 

Desired outcome

| email           | content     | unknown_email   | 
| --------------- | ----------- | --------------- |
| [email protected] | Hello there | [email protected] | 
| [email protected]  | test test   | null            |   
| [email protected]   | foobar      | null            | 

Edit 2: Still not working. I added the substr and regexp filter in there. Not sure if that's what's affecting my result.

Upvotes: 0

Views: 90

Answers (1)

Alexandre Moraes
Alexandre Moraes

Reputation: 4042

I was able to reproduce you error with you raw data. I realized that the problem was in you REGEX EXP. Therefore, it should be:

REGEXP_EXTRACT(email, r"^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.+[a-zA-Z0-9-.]+$") as email

Changing the code with the above line, you should be able to retrieve the desired output.

I am glad I could help you.

Upvotes: 1

Related Questions