Can SPLIT be used with multiple delimiters?

Question

I'd like to use SPLIT function to do word analysis of various text entries, in this example git commit comments. Typically a word is delimited by a space, but I'd also like to include commas, semicolons, colons, periods, question marks, exclamation marks, tabs, new lines in the delimiter list. Essentially use a REGEX pattern to specify delimiter characters and if any of them is found, treat it as a delimiter.

For example:

SELECT 
    split(commit_message, " ") as words,
FROM [project:dataset.table] 
LIMIT 1000

If the input data was something like:

"Commit message XYX:  Hello.  This is a test.  This is a fun test!  First, we'll run a test, then we'll check the results.  A test is currently running."

I would expect if we do a GROUP BY words that the word "test" would have a COUNT of of 4, but using the query above test only gets counted once. It would be great if the delimiter field accepted a REGEXP similar to below, but I don't think this is available, or the syntax isn't published.

SELECT 
    split(commit_message, "[\W]+") as words,
FROM [project:dataset.table] 
LIMIT 1000

In the example above if one or more non-word characters was detected, these would all be treated as delimiters. If this feature doesn't exist, could it be considered for a future improvement? At this point in time, I need to take the results in the "words" column and strip all non-word characters to get what I want. (see below)

SELECT
    LOWER(REGEXP_EXTRACT(words, r'(\w+)')) as words
FROM
    (
    SELECT 
        split(commit_message, " ") as words,
    FROM [project:dataset.table] 
    )
LIMIT 1000

I appreciate it if you have suggestions to avoid this additional step of extracting non-word characters.

sprocket · Accepted Answer

The SPLIT function only accepts constant strings as delimiters. There's no hidden syntax for regular expression delimiters.

An alternative that you could try is using REGEXP_REPLACE to replace all of your desired delimiters with spaces or any single delimiter, like this:

SPLIT(REGEXP_REPLACE(message, ",|;|:|\.|\?|!|	|
", " "), " ")

Can SPLIT be used with multiple delimiters?

Answers (2)

Related Questions