I'm trying to extract mail addresses after a token 'eaddr:'. So it would match the all occurrences in line entries, first consecutive string without spaces after that token: I tried: SELECT regexp_substr(tab.entry, 'eaddr:\(.*?\)',1,1,'e',1) from ( select 'String, email@domain.com' as entry union select 'eaddr:mail1@domain.com eaddr:mail2@domain.com sometext eaddr: mail3@domain.com some4354% text' as entry union select 'eaddr:mail5@domain.org' as entry union select 'Just a string' as entry ) tab ; but it does not work. The correct result set is: null mail1@domain.com mail2@domain.com mail3@domain.com mail5@domain.org null

First of all, I suggest using a better regex to verify the email format. I am inspired by Gordon's SPLIT_TO_TABLE + LATERAL approach, and wrote some sample queries to fetch those emails from the entries. If you want to get all the emails together, you can use this one: with t as ( select 'String, email@domain.com' as entry union select 'eaddr:mail1@domain.com eaddr:mail2@domain.com sometext eaddr: mail3@domain.com some4354% text' as entry union select 'eaddr:mail5@domain.org' as entry union select 'Just a string' as entry ) Select LISTAGG( regexp_substr( s.value, '[A-Z0-9a-z._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,64}' ) ,' ' ) emails from t, lateral SPLIT_TO_TABLE(t.entry, 'eaddr:') s where s.seq > 1; +---------------------------------------------------------------------+ | EMAILS | +---------------------------------------------------------------------+ | mail1@domain.com mail2@domain.com mail3@domain.com mail5@domain.org | +---------------------------------------------------------------------+ To get the exact result in your question, you can use the following query: with t as ( select 'String, email@domain.com' as entry union select 'eaddr:mail1@domain.com eaddr:mail2@domain.com sometext eaddr: mail3@domain.com some4354% text' as entry union select 'eaddr:mail5@domain.org' as entry union select 'Just a string' as entry ) select emails from ( Select t.entry, s.*, LISTAGG( regexp_substr( IFF(s.seq = 1, '', s.value ), '[A-Z0-9a-z._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,64}' ) ,' ' ) OVER ( PARTITION BY s.seq ) emails from t, lateral SPLIT_TO_TABLE(t.entry, ' ') s ) where index = 1; +----------------------------------------------------+ | EMAILS | +----------------------------------------------------+ | NULL | | mail1@domain.com mail2@domain.com mail3@domain.com | | NULL | | mail5@domain.org | +----------------------------------------------------+

sqlregexsnowflake-cloud-data-platform

jrara

Reputation: 16981

How to match a string after a token (using regex)?

I'm trying to extract mail addresses after a token 'eaddr:'. So it would match the all occurrences in line entries, first consecutive string without spaces after that token: I tried:

SELECT regexp_substr(tab.entry, 'eaddr:\(.*?\)',1,1,'e',1)
from (
select 'String, [email protected]' as entry
union
select 'eaddr:[email protected] eaddr:[email protected] sometext     eaddr:   [email protected] some4354% text' as entry
union
select 'eaddr:[email protected]' as entry
union
select 'Just a string' as entry
) tab
;

but it does not work. The correct result set is:

null
[email protected] [email protected] [email protected]
[email protected]
null

Upvotes: 0

Answers (4)

PIG

Reputation: 602

Using Javascript UDF

    create or replace function ext_mail(col VARCHAR)
                                      returns varchar 
                                      language javascript
                                      as
                                      $$
                                      var y = COL.match(/(?!eaddr):(\s+)?\w+@\w+/g);
                                      if (y) {
                                      ext_out =  y.join(' ');
                                      return ext_out.replace(/:|\s+/g,' ') 
                                      }
                                      else return 'NULL'
                                      $$
                                      ;



with t as (
                                  select 'String, [email protected]' as entry
                                  union
                                  select 'eaddr:[email protected] eaddr:[email protected] sometext     eaddr:   [email protected] some4354% text' as entry
                                  union
                                  select 'eaddr:[email protected]' as entry
                                  union
                                  select 'Just a string' as entry
                                  ) select ext_mail(ENTRY) from t;

Upvotes: 0

Gokhan Atil

Reputation: 10039

First of all, I suggest using a better regex to verify the email format. I am inspired by Gordon's SPLIT_TO_TABLE + LATERAL approach, and wrote some sample queries to fetch those emails from the entries.

If you want to get all the emails together, you can use this one:

with t as (
select 'String, [email protected]' as entry
union
select 'eaddr:[email protected] eaddr:[email protected] sometext     eaddr:   [email protected] some4354% text' as entry
union
select 'eaddr:[email protected]' as entry
union
select 'Just a string' as entry
) 
Select LISTAGG( regexp_substr( s.value, '[A-Z0-9a-z._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,64}' ) ,' ' ) emails from t,
lateral SPLIT_TO_TABLE(t.entry, 'eaddr:') s
where s.seq > 1;

+---------------------------------------------------------------------+
|                               EMAILS                                |
+---------------------------------------------------------------------+
| [email protected] [email protected] [email protected] [email protected] |
+---------------------------------------------------------------------+

To get the exact result in your question, you can use the following query:

with t as (
select 'String, [email protected]' as entry
union
select 'eaddr:[email protected] eaddr:[email protected] sometext     eaddr:   [email protected] some4354% text' as entry
union
select 'eaddr:[email protected]' as entry
union
select 'Just a string' as entry
) 
select emails from 
(
Select t.entry, s.*, 
LISTAGG( regexp_substr( IFF(s.seq = 1, '', s.value ), '[A-Z0-9a-z._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,64}' ) ,' ' ) 
OVER ( PARTITION BY s.seq ) emails
from t,
lateral SPLIT_TO_TABLE(t.entry, ' ') s ) 
where index = 1;

+----------------------------------------------------+
|                       EMAILS                       |
+----------------------------------------------------+
| NULL                                               |
| [email protected] [email protected] [email protected] |
| NULL                                               |
| [email protected]                                   |
+----------------------------------------------------+

Upvotes: 2

Gordon Linoff

Reputation: 1269463

You need to split the strings, extract the emails, and then reaggregate. I don't have Snowflake on hand, but this or something similar should do:

select t.*, s.emails
from t left join lateral
     (select list_agg(split(s.value, ' ')), ' ') as emails
      from table(string_split_to_table(t.entry, 'eaddr:')) as s
     ) s;

I'm not 100% sure that Snowflake supports multiple-character delimiters, for instance. If that is the case, you can use:

select t.*, s.emails
from t left join lateral
     (select list_agg(substr(s.value, 7), ' ') as emails
      from table(string_split_to_table(t.entry, ' ')) as s
      where value like 'eaddr:%'
     ) s;

Upvotes: 0

user9601310

Reputation: 1086

As far as I know, you can return only one match at a time from REGEXP_SUBSTR. The code below:

with tab(entry) as  (
select 'String, [email protected]' from dual
union
select 'eaddr:[email protected] eaddr:[email protected] sometext     eaddr:   [email protected] some4354% text'  from dual
union
select 'eaddr:[email protected]'  from dual
union
select 'Just a string'  from dual
) 
SELECT
      regexp_substr(entry, 'eaddr:\s*(\S*)\s*',1,1,'i',1)
      || coalesce(' ' || regexp_substr(entry, 'eaddr:\s*(\S*)\s*',1,2,'i', 1), '')
      || coalesce(' ' || regexp_substr(entry, 'eaddr:\s*(\S*)\s*',1,3,'i', 1), '') as match,
      regexp_count(entry, 'eaddr:\s*(\S*)\s*') as nmatches
      from tab

gives the result below (using Oracle). You can use REGEXP_COUNT as shown to get the number of matches. If there are more than 3 email addresses, you can add more || coalesce( lines as needed.

P.S. I'm not sure what the 'e' flag does in your example. I'm guessing that is a Snowflake-specific value.

Upvotes: 0

How to match a string after a token (using regex)?

Answers (4)

Related Questions