Reputation: 345
I'd like to use SPLIT function to do word analysis of various text entries, in this example git commit comments. Typically a word is delimited by a space, but I'd also like to include commas, semicolons, colons, periods, question marks, exclamation marks, tabs, new lines in the delimiter list. Essentially use a REGEX pattern to specify delimiter characters and if any of them is found, treat it as a delimiter.
For example:
SELECT
split(commit_message, " ") as words,
FROM [project:dataset.table]
LIMIT 1000
If the input data was something like:
"Commit message XYX: Hello. This is a test. This is a fun test! First, we'll run a test, then we'll check the results. A test is currently running."
I would expect if we do a GROUP BY words that the word "test" would have a COUNT of of 4, but using the query above test only gets counted once. It would be great if the delimiter field accepted a REGEXP similar to below, but I don't think this is available, or the syntax isn't published.
SELECT
split(commit_message, "[\W]+") as words,
FROM [project:dataset.table]
LIMIT 1000
In the example above if one or more non-word characters was detected, these would all be treated as delimiters. If this feature doesn't exist, could it be considered for a future improvement? At this point in time, I need to take the results in the "words" column and strip all non-word characters to get what I want. (see below)
SELECT
LOWER(REGEXP_EXTRACT(words, r'(\w+)')) as words
FROM
(
SELECT
split(commit_message, " ") as words,
FROM [project:dataset.table]
)
LIMIT 1000
I appreciate it if you have suggestions to avoid this additional step of extracting non-word characters.
Upvotes: 3
Views: 9880
Reputation: 59185
UPDATE: See full article at http://www.reddit.com/r/bigquery/comments/2kqe4g/words_that_these_developers_say_that_others_dont/
What @sprocket said: Use REGEX_REPLACE first, then SPLIT().
See http://www.reddit.com/r/bigquery/comments/2ep8np/mining_the_top_news_words_for_each_day_with_gdelt for a similar analysis.
A working query, what Python developers says that JavaScript developers don't say:
SELECT word, c
FROM (
SELECT word, COUNT(*) c
FROM (
SELECT SPLIT(msg, ' ') word
FROM (
SELECT REGEXP_REPLACE(LOWER(payload_commit_msg), r'[^a-z]', ' ') msg
FROM [githubarchive:github.timeline]
WHERE
repository_language == 'Python'
AND payload_commit_msg != ''
GROUP EACH BY msg
)
)
GROUP BY word
ORDER BY c DESC
LIMIT 500
)
WHERE word NOT IN (
SELECT x FROM (SELECT word x, COUNT(*) c
FROM (
SELECT SPLIT(msg, ' ') word
FROM (
SELECT REGEXP_REPLACE(LOWER(payload_commit_msg), r'[^a-z]', ' ') msg
FROM [githubarchive:github.timeline]
WHERE
repository_language == 'JavaScript'
AND payload_commit_msg != ''
GROUP EACH BY msg
)
)
GROUP BY x
ORDER BY c DESC
LIMIT 1000)
);
See full article at http://www.reddit.com/r/bigquery/comments/2kqe4g/words_that_these_developers_say_that_others_dont/
Upvotes: 2
Reputation: 1277
The SPLIT function only accepts constant strings as delimiters. There's no hidden syntax for regular expression delimiters.
An alternative that you could try is using REGEXP_REPLACE to replace all of your desired delimiters with spaces or any single delimiter, like this:
SPLIT(REGEXP_REPLACE(message, ",|;|:|\\.|\\?|!|\t|\n", " "), " ")
Upvotes: 6