Steve Scherer
Steve Scherer

Reputation: 345

Can SPLIT be used with multiple delimiters?

I'd like to use SPLIT function to do word analysis of various text entries, in this example git commit comments. Typically a word is delimited by a space, but I'd also like to include commas, semicolons, colons, periods, question marks, exclamation marks, tabs, new lines in the delimiter list. Essentially use a REGEX pattern to specify delimiter characters and if any of them is found, treat it as a delimiter.

For example:

SELECT 
    split(commit_message, " ") as words,
FROM [project:dataset.table] 
LIMIT 1000

If the input data was something like:

"Commit message XYX:  Hello.  This is a test.  This is a fun test!  First, we'll run a test, then we'll check the results.  A test is currently running."

I would expect if we do a GROUP BY words that the word "test" would have a COUNT of of 4, but using the query above test only gets counted once. It would be great if the delimiter field accepted a REGEXP similar to below, but I don't think this is available, or the syntax isn't published.

SELECT 
    split(commit_message, "[\W]+") as words,
FROM [project:dataset.table] 
LIMIT 1000

In the example above if one or more non-word characters was detected, these would all be treated as delimiters. If this feature doesn't exist, could it be considered for a future improvement? At this point in time, I need to take the results in the "words" column and strip all non-word characters to get what I want. (see below)

SELECT
    LOWER(REGEXP_EXTRACT(words, r'(\w+)')) as words
FROM
    (
    SELECT 
        split(commit_message, " ") as words,
    FROM [project:dataset.table] 
    )
LIMIT 1000

I appreciate it if you have suggestions to avoid this additional step of extracting non-word characters.

Upvotes: 3

Views: 9880

Answers (2)

Felipe Hoffa
Felipe Hoffa

Reputation: 59185

UPDATE: See full article at http://www.reddit.com/r/bigquery/comments/2kqe4g/words_that_these_developers_say_that_others_dont/


What @sprocket said: Use REGEX_REPLACE first, then SPLIT().

See http://www.reddit.com/r/bigquery/comments/2ep8np/mining_the_top_news_words_for_each_day_with_gdelt for a similar analysis.

A working query, what Python developers says that JavaScript developers don't say:

SELECT word, c 
FROM (
  SELECT word, COUNT(*) c
  FROM (
    SELECT SPLIT(msg, ' ') word
    FROM (
      SELECT REGEXP_REPLACE(LOWER(payload_commit_msg), r'[^a-z]', ' ') msg
      FROM [githubarchive:github.timeline]
      WHERE
        repository_language == 'Python'
        AND payload_commit_msg != ''
      GROUP EACH BY msg
    )
  )
  GROUP BY word
  ORDER BY c DESC
  LIMIT 500
)
WHERE word NOT IN (
  SELECT x FROM (SELECT word x, COUNT(*) c
  FROM (
    SELECT SPLIT(msg, ' ') word
    FROM (
      SELECT REGEXP_REPLACE(LOWER(payload_commit_msg), r'[^a-z]', ' ') msg
      FROM [githubarchive:github.timeline]
      WHERE
        repository_language == 'JavaScript'
        AND payload_commit_msg != ''
      GROUP EACH BY msg
    )
  )
  GROUP BY x
  ORDER BY c DESC
  LIMIT 1000)
);

See full article at http://www.reddit.com/r/bigquery/comments/2kqe4g/words_that_these_developers_say_that_others_dont/

Upvotes: 2

sprocket
sprocket

Reputation: 1277

The SPLIT function only accepts constant strings as delimiters. There's no hidden syntax for regular expression delimiters.

An alternative that you could try is using REGEXP_REPLACE to replace all of your desired delimiters with spaces or any single delimiter, like this:

SPLIT(REGEXP_REPLACE(message, ",|;|:|\\.|\\?|!|\t|\n", " "), " ")

Upvotes: 6

Related Questions