Reputation: 21
Is there any solution in bigquery to break a column of string length 1500 characters should be split into 264 characters in each columns without breaking/splitting the words
Upvotes: 1
Views: 770
Reputation: 12234
DECLARE LONG_SENTENCE DEFAULT "It was my fourth day walking the Island Walk, a new 700km route that circles Canada's smallest province. Starting on PEI's rural west end, I had walked past vinyl-clad farmhouses with ocean vistas, along a boardwalk beneath whirling wind turbines, and above red clay cliffs that plunged sharply into the sea. I had stopped for a midday country music hour at the Stompin' Tom Centre, honouring Canadian singer-songwriter Tom Connors. I'd tromped through the rain along a secluded, wooded trail where swarms of canny mosquitos tried to shelter under my umbrella. And after learning about PEI's major crop at the Canadian Potato Museum, I had fuelled my day's walk with an extra-large cheese-topped baked potato served with freshly made potato chips. You know that a place is serious about its spuds when your potato comes with a side of potatoes.";
CREATE TEMP FUNCTION cumsumbin(a ARRAY<INT64>) RETURNS INT64
LANGUAGE js AS """
bin = 0;
a.reduce((c, v) => {
if (c + Number(v) > 264) { bin += 1; return Number(v); }
else return c += Number(v);
}, 0);
return bin;
""";
WITH splits AS (
SELECT w, cumsumbin(ARRAY_AGG(LENGTH(w) + 1) OVER (ORDER BY o)) AS bin
FROM UNNEST(SPLIT(LONG_SENTENCE, ' ')) w WITH OFFSET o
)
SELECT * FROM (
SELECT bin, STRING_AGG(w, ' ') AS segment
FROM splits
GROUP BY 1
) PIVOT (ANY_VALUE(segment) AS segment FOR bin IN (0, 1, 2, 3))
;
segment_0 | segment_1 | segment_2 | segment_3 |
---|---|---|---|
It was my fourth day walking the Island Walk, a new 700km route that circles Canada's smallest province. Starting on PEI's rural west end, I had walked past vinyl-clad farmhouses with ocean vistas, along a boardwalk beneath whirling wind turbines, and above red | clay cliffs that plunged sharply into the sea. I had stopped for a midday country music hour at the Stompin' Tom Centre, honouring Canadian singer-songwriter Tom Connors. I'd tromped through the rain along a secluded, wooded trail where swarms of canny mosquitos | tried to shelter under my umbrella. And after learning about PEI's major crop at the Canadian Potato Museum, I had fuelled my day's walk with an extra-large cheese-topped baked potato served with freshly made potato chips. You know that a place is serious about | its spuds when your potato comes with a side of potatoes. |
segment_0 | segment_1 | segment_2 | segment_3 |
---|---|---|---|
261 | 262 | 261 | 57 |
[note] below expression (.{1,264}\b) is simple but word boundary doesn't include a period(.), thus result can have some error. You can see last period(.) in segment_3 is missing. But under centain circumtances this might be useful, I think.
SELECT * FROM (
SELECT *
FROM UNNEST(REGEXP_EXTRACT_ALL(LONG_SENTENCE, r'(.{1,264}\b)')) segment WITH OFFSET o
) PIVOT (ANY_VALUE(segment) segment FOR o IN (0, 1, 2, 3));
segment_0 | segment_1 | segment_2 | segment_3 |
---|---|---|---|
It was my fourth day walking the Island Walk, a new 700km route that circles Canada's smallest province. Starting on PEI's rural west end, I had walked past vinyl-clad farmhouses with ocean vistas, along a boardwalk beneath whirling wind turbines, and above red | clay cliffs that plunged sharply into the sea. I had stopped for a midday country music hour at the Stompin' Tom Centre, honouring Canadian singer-songwriter Tom Connors. I'd tromped through the rain along a secluded, wooded trail where swarms of canny mosquitos | tried to shelter under my umbrella. And after learning about PEI's major crop at the Canadian Potato Museum, I had fuelled my day's walk with an extra-large cheese-topped baked potato served with freshly made potato chips. You know that a place is serious about | its spuds when your potato comes with a side of potatoes |
segment_0 | segment_1 | segment_2 | segment_3 |
---|---|---|---|
261 | 262 | 261 | 56 |
Upvotes: 1
Reputation: 172974
Consider below approach
create temp function split_parts(parts array<string>, max_len int64) returns array<string>
language js as """
var arr = [];
var part = '';
for (i = 0; i < parts.length; i++) {
if (part.length + parts[i].length < max_len){part += parts[i]}
else {arr.push(part); part = parts[i];}
}
arr.push(part);
return arr;
""";
select * from (
select id, offset, part
from your_table, unnest(split_parts(regexp_extract_all(col, r'[^ ]+ ?'), 50)) part with offset
)
pivot (any_value(trim(part)) as part for offset in (0, 1, 2, 3))
if applied to dummy data as below with split size = 50
output is
Upvotes: 1
Reputation: 3528
Regular expression are a good way to accomplish this task. However, BigQuery is still quite limited in the usasge of regular expression. Therefore, I would suggest to solve this with a UDF and JavaScript. A solution for JavaScript can be found here: https://www.tutorialspoint.com/how-to-split-sentence-into-blocks-of-fixed-length-without-breaking-words-in-javascript
Adaption this solution to BigQuery
The function string_split
expects the character counts to be splitted and the text to be splitted. It returns an array with the chunks. The chunks can be two characters longer than the given size value due to the spaces.
CREATE TEMP FUNCTION string_split(size int64,str string)
RETURNS ARRAY<STRING>
LANGUAGE js AS r"""
const regraw='\\S.{3,' + size + '}\\S(?= |$)';
const regex = new RegExp(new RegExp(regraw, 'g'), 'g');
return str.match(regex);
""";
SELECT text, split_text,
#length(split_text)
FROM
(
SELECT
text,string_split(20,text) as split_text
FROM (
SELECT "Is there any solution in bigquery to break a column of string length 1500 characters should be split into 264 characters in each columns without breaking/splitting the words" AS text
UNION ALL SELECT "This is a short text. And can be splitted as well."
)
)
#, unnest(split_text) as split_text #
Please uncomment the two lines to split the text from the array into single rows.
For larger datasets it also works and took less than two minutes:
CREATE TEMP FUNCTION string_split(size int64,str string)
RETURNS ARRAY<STRING>
LANGUAGE js AS r"""
const regraw='\\S.{3,' + size + '}\\S(?= |$)';
const regex = new RegExp(new RegExp(regraw, 'g'), 'g');
return str.match(regex);
""";
SELECT text, split_text,
length(split_text)
FROM
(
SELECT
text,string_split(40,text) as split_text
FROM (
SELECT abstract as text from `bigquery-public-data.breathe.jama`
)
)
, unnest(split_text) as split_text #
order by 3 desc
Upvotes: 1