chadrika ramineni
chadrika ramineni

Reputation: 21

Using Google big query sql split the string in a column to multiple columns without breaking words

Is there any solution in bigquery to break a column of string length 1500 characters should be split into 264 characters in each columns without breaking/splitting the words

Upvotes: 1

Views: 770

Answers (3)

Jaytiger
Jaytiger

Reputation: 12234

Non-regexp Approach

DECLARE LONG_SENTENCE DEFAULT "It was my fourth day walking the Island Walk, a new 700km route that circles Canada's smallest province. Starting on PEI's rural west end, I had walked past vinyl-clad farmhouses with ocean vistas, along a boardwalk beneath whirling wind turbines, and above red clay cliffs that plunged sharply into the sea. I had stopped for a midday country music hour at the Stompin' Tom Centre, honouring Canadian singer-songwriter Tom Connors. I'd tromped through the rain along a secluded, wooded trail where swarms of canny mosquitos tried to shelter under my umbrella. And after learning about PEI's major crop at the Canadian Potato Museum, I had fuelled my day's walk with an extra-large cheese-topped baked potato served with freshly made potato chips. You know that a place is serious about its spuds when your potato comes with a side of potatoes.";

CREATE TEMP FUNCTION cumsumbin(a ARRAY<INT64>) RETURNS INT64
LANGUAGE js AS """
  bin = 0;
  a.reduce((c, v) => {
    if (c + Number(v) > 264) { bin += 1; return Number(v); }
    else return c += Number(v); 
  }, 0);

  return bin;
""";

WITH splits AS (
  SELECT w, cumsumbin(ARRAY_AGG(LENGTH(w) + 1) OVER (ORDER BY o)) AS bin 
    FROM UNNEST(SPLIT(LONG_SENTENCE, ' ')) w WITH OFFSET o
)
SELECT * FROM (
  SELECT bin, STRING_AGG(w, ' ') AS segment
    FROM splits
   GROUP BY 1
)  PIVOT (ANY_VALUE(segment) AS segment FOR bin IN (0, 1, 2, 3))
;
Query results:
segment_0 segment_1 segment_2 segment_3
It was my fourth day walking the Island Walk, a new 700km route that circles Canada's smallest province. Starting on PEI's rural west end, I had walked past vinyl-clad farmhouses with ocean vistas, along a boardwalk beneath whirling wind turbines, and above red clay cliffs that plunged sharply into the sea. I had stopped for a midday country music hour at the Stompin' Tom Centre, honouring Canadian singer-songwriter Tom Connors. I'd tromped through the rain along a secluded, wooded trail where swarms of canny mosquitos tried to shelter under my umbrella. And after learning about PEI's major crop at the Canadian Potato Museum, I had fuelled my day's walk with an extra-large cheese-topped baked potato served with freshly made potato chips. You know that a place is serious about its spuds when your potato comes with a side of potatoes.
Length of each segment
segment_0 segment_1 segment_2 segment_3
261 262 261 57

Regexp Approach

[note] below expression (.{1,264}\b) is simple but word boundary doesn't include a period(.), thus result can have some error. You can see last period(.) in segment_3 is missing. But under centain circumtances this might be useful, I think.

SELECT * FROM (
  SELECT * 
    FROM UNNEST(REGEXP_EXTRACT_ALL(LONG_SENTENCE, r'(.{1,264}\b)')) segment WITH OFFSET o
) PIVOT (ANY_VALUE(segment) segment FOR o IN (0, 1, 2, 3));
Query rseults:
segment_0 segment_1 segment_2 segment_3
It was my fourth day walking the Island Walk, a new 700km route that circles Canada's smallest province. Starting on PEI's rural west end, I had walked past vinyl-clad farmhouses with ocean vistas, along a boardwalk beneath whirling wind turbines, and above red clay cliffs that plunged sharply into the sea. I had stopped for a midday country music hour at the Stompin' Tom Centre, honouring Canadian singer-songwriter Tom Connors. I'd tromped through the rain along a secluded, wooded trail where swarms of canny mosquitos tried to shelter under my umbrella. And after learning about PEI's major crop at the Canadian Potato Museum, I had fuelled my day's walk with an extra-large cheese-topped baked potato served with freshly made potato chips. You know that a place is serious about its spuds when your potato comes with a side of potatoes

Length of each segment

segment_0 segment_1 segment_2 segment_3
261 262 261 56

Upvotes: 1

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172974

Consider below approach

create temp function split_parts(parts array<string>, max_len int64) returns array<string>
language js as """
  var arr = [];
  var part = '';  
  for (i = 0; i < parts.length; i++) { 
    if (part.length + parts[i].length < max_len){part += parts[i]}
    else {arr.push(part); part = parts[i];}
  }
  arr.push(part);
  return arr;
""";
select * from (
  select id, offset, part
  from your_table, unnest(split_parts(regexp_extract_all(col, r'[^ ]+ ?'), 50)) part with offset
)
pivot (any_value(trim(part)) as part for offset in (0, 1, 2, 3))

if applied to dummy data as below with split size = 50

enter image description here

output is

enter image description here

Upvotes: 1

Samuel
Samuel

Reputation: 3528

Regular expression are a good way to accomplish this task. However, BigQuery is still quite limited in the usasge of regular expression. Therefore, I would suggest to solve this with a UDF and JavaScript. A solution for JavaScript can be found here: https://www.tutorialspoint.com/how-to-split-sentence-into-blocks-of-fixed-length-without-breaking-words-in-javascript

Adaption this solution to BigQuery

The function string_split expects the character counts to be splitted and the text to be splitted. It returns an array with the chunks. The chunks can be two characters longer than the given size value due to the spaces.

CREATE TEMP FUNCTION string_split(size int64,str string)
  RETURNS ARRAY<STRING>
  LANGUAGE js AS r""" 
    const regraw='\\S.{3,' + size + '}\\S(?= |$)';
    const regex = new RegExp(new RegExp(regraw, 'g'), 'g');
    return str.match(regex);
  """;
SELECT text, split_text,
#length(split_text)
FROM
(
SELECT
  text,string_split(20,text) as split_text
FROM (
  SELECT  "Is there any solution in bigquery to break a column of string length 1500 characters should be split into 264 characters in each columns without breaking/splitting the words" AS text
  UNION ALL  SELECT  "This is a short text. And can be splitted as well." 
    )
)
#, unnest(split_text) as split_text # 

Please uncomment the two lines to split the text from the array into single rows.

For larger datasets it also works and took less than two minutes:

CREATE TEMP FUNCTION string_split(size int64,str string)
  RETURNS ARRAY<STRING>
  LANGUAGE js AS r""" 
    const regraw='\\S.{3,' + size + '}\\S(?= |$)';
    const regex = new RegExp(new RegExp(regraw, 'g'), 'g');
    return str.match(regex);
  """;
SELECT text, split_text,
length(split_text)
FROM
(
SELECT
  text,string_split(40,text) as split_text
FROM (
  SELECT abstract as text from `bigquery-public-data.breathe.jama`
    )
)
, unnest(split_text) as split_text # 
order by 3 desc

Upvotes: 1

Related Questions