Reputation: 16981
I'm trying to extract mail addresses after a token 'eaddr:'. So it would match the all occurrences in line entries, first consecutive string without spaces after that token: I tried:
SELECT regexp_substr(tab.entry, 'eaddr:\(.*?\)',1,1,'e',1)
from (
select 'String, [email protected]' as entry
union
select 'eaddr:[email protected] eaddr:[email protected] sometext eaddr: [email protected] some4354% text' as entry
union
select 'eaddr:[email protected]' as entry
union
select 'Just a string' as entry
) tab
;
but it does not work. The correct result set is:
null
[email protected] [email protected] [email protected]
[email protected]
null
Upvotes: 0
Views: 403
Reputation: 602
Using Javascript UDF
create or replace function ext_mail(col VARCHAR)
returns varchar
language javascript
as
$$
var y = COL.match(/(?!eaddr):(\s+)?\w+@\w+/g);
if (y) {
ext_out = y.join(' ');
return ext_out.replace(/:|\s+/g,' ')
}
else return 'NULL'
$$
;
with t as (
select 'String, [email protected]' as entry
union
select 'eaddr:[email protected] eaddr:[email protected] sometext eaddr: [email protected] some4354% text' as entry
union
select 'eaddr:[email protected]' as entry
union
select 'Just a string' as entry
) select ext_mail(ENTRY) from t;
Upvotes: 0
Reputation: 10039
First of all, I suggest using a better regex to verify the email format. I am inspired by Gordon's SPLIT_TO_TABLE + LATERAL approach, and wrote some sample queries to fetch those emails from the entries.
If you want to get all the emails together, you can use this one:
with t as (
select 'String, [email protected]' as entry
union
select 'eaddr:[email protected] eaddr:[email protected] sometext eaddr: [email protected] some4354% text' as entry
union
select 'eaddr:[email protected]' as entry
union
select 'Just a string' as entry
)
Select LISTAGG( regexp_substr( s.value, '[A-Z0-9a-z._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,64}' ) ,' ' ) emails from t,
lateral SPLIT_TO_TABLE(t.entry, 'eaddr:') s
where s.seq > 1;
+---------------------------------------------------------------------+
| EMAILS |
+---------------------------------------------------------------------+
| [email protected] [email protected] [email protected] [email protected] |
+---------------------------------------------------------------------+
To get the exact result in your question, you can use the following query:
with t as (
select 'String, [email protected]' as entry
union
select 'eaddr:[email protected] eaddr:[email protected] sometext eaddr: [email protected] some4354% text' as entry
union
select 'eaddr:[email protected]' as entry
union
select 'Just a string' as entry
)
select emails from
(
Select t.entry, s.*,
LISTAGG( regexp_substr( IFF(s.seq = 1, '', s.value ), '[A-Z0-9a-z._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,64}' ) ,' ' )
OVER ( PARTITION BY s.seq ) emails
from t,
lateral SPLIT_TO_TABLE(t.entry, ' ') s )
where index = 1;
+----------------------------------------------------+
| EMAILS |
+----------------------------------------------------+
| NULL |
| [email protected] [email protected] [email protected] |
| NULL |
| [email protected] |
+----------------------------------------------------+
Upvotes: 2
Reputation: 1269463
You need to split the strings, extract the emails, and then reaggregate. I don't have Snowflake on hand, but this or something similar should do:
select t.*, s.emails
from t left join lateral
(select list_agg(split(s.value, ' ')), ' ') as emails
from table(string_split_to_table(t.entry, 'eaddr:')) as s
) s;
I'm not 100% sure that Snowflake supports multiple-character delimiters, for instance. If that is the case, you can use:
select t.*, s.emails
from t left join lateral
(select list_agg(substr(s.value, 7), ' ') as emails
from table(string_split_to_table(t.entry, ' ')) as s
where value like 'eaddr:%'
) s;
Upvotes: 0
Reputation: 1086
As far as I know, you can return only one match at a time from REGEXP_SUBSTR. The code below:
with tab(entry) as (
select 'String, [email protected]' from dual
union
select 'eaddr:[email protected] eaddr:[email protected] sometext eaddr: [email protected] some4354% text' from dual
union
select 'eaddr:[email protected]' from dual
union
select 'Just a string' from dual
)
SELECT
regexp_substr(entry, 'eaddr:\s*(\S*)\s*',1,1,'i',1)
|| coalesce(' ' || regexp_substr(entry, 'eaddr:\s*(\S*)\s*',1,2,'i', 1), '')
|| coalesce(' ' || regexp_substr(entry, 'eaddr:\s*(\S*)\s*',1,3,'i', 1), '') as match,
regexp_count(entry, 'eaddr:\s*(\S*)\s*') as nmatches
from tab
gives the result below (using Oracle). You can use REGEXP_COUNT as shown to get the number of matches. If there are more than 3 email addresses, you can add more || coalesce(
lines as needed.
P.S. I'm not sure what the 'e' flag does in your example. I'm guessing that is a Snowflake-specific value.
Upvotes: 0