Reputation: 3198

Extract many lines of text in a sheet cell using regex

I need to extract the lines of text without header and upto the line ,,,,,Total

INPUT

Sr.No ,Description of Goods ,HSN ,,City ,Unit ,,Unit Price ,Amount (in BP) 
1 ,29550762R3C0-Top- ST22E360 ,84439990 ,,1.00 ,Numbers ,,5000.00 ,5000.00 
2 ,29550762R3C0-Bot- ST22E361 ,84439990 ,,1.00 ,Numbers ,,5000.00 ,5000.00 
3 ,2945486609-G2 - Bot Stencil - ST22F199 ,84439990 ,,1.00 ,Numbers ,,5000.00 ,5000.00 
4 ,2945457600-S1 - Top Stencil - ST22F224 ,84439990 ,,1.00 ,Numbers ,,5000.00 ,5000.00 
5 ,2945457600-G2 - Bot Stencil - ST22F225 ,84439990 ,,1.00 ,Numbers ,,5000.00 ,5000.00 
,,,,,Total ,,,25000.00 IN 
,,,,,Discount ,,,0.00 3 
,,,,Total Amount ,Before Tax ,,,"₹ 25,000.00 "
,,,,,IGST 0% ,,,₹ 0.00 
,,,,Total Amount ,After Tax: ,,,"₹ 25,000.00 "

EXPECTED OUTPUT

1 ,29550762R3C0-Top- ST22E360 ,84439990 ,,1.00 ,Numbers ,,5000.00 ,5000.00 
2 ,29550762R3C0-Bot- ST22E361 ,84439990 ,,1.00 ,Numbers ,,5000.00 ,5000.00 
3 ,2945486609-G2 - Bot Stencil - ST22F199 ,84439990 ,,1.00 ,Numbers ,,5000.00 ,5000.00 
4 ,2945457600-S1 - Top Stencil - ST22F224 ,84439990 ,,1.00 ,Numbers ,,5000.00 ,5000.00 
5 ,2945457600-G2 - Bot Stencil - ST22F225 ,84439990 ,,1.00 ,Numbers ,,5000.00 ,5000.00

I have tried with: =REGEXEXTRACT(A10,"(\n.*)"), The number of lines are not known in advance and can vary from cell to cell. Some may have one and some have 5 to 6.

Upvotes: 0

Answers (3)

RavinderSingh13

Reputation: 133538

With your shown samples Only, please try following solutions. Here is the Online demo for shown regex.

Google sheet formula with regex: THANKS to @Jvdv for such a nice formula, credit to him for throwing light on this one.

=FLATTEN(SPLIT(REGEXEXTRACT(A1,"((?:\n[0-9]+\s+,[0-9a-zA-Z]+-.*?-.*?,[0-9]+(?:.*?,){2}[0-9]+(?:\.[0-9]+)?\s+,Numbers\s+(?:.*?,){2}[0-9]+(?:\.[0-9]+)?\s+,[0-9]+(?:\.[0-9]+)?\b\s?)+)"),CHAR(10),1,1))

Only regex solution: Where we can use regex inside REGEXEXTRACT function but since it generates only 1 capturing group so keeping whole regex into one capturing group and then we need to split it with new lines, which is why ABOVE solution is kept as an whole solution.

(^[0-9]+\s+,[0-9a-zA-Z]+-.*?-.*?,[0-9]+(?:.*?,){2}[0-9]+(?:\.[0-9]+)?\s+,Numbers\s+(?:.*?,){2}[0-9]+(?:\.[0-9]+)?\s+,[0-9]+(?:\.[0-9]+)?\b)

Explanation: Adding detailed explanation for above regex.

(                   ##starting main capturing group.
^[0-9]+\s+,         ##Matching digits from starting of value followed by spaces followed by comma.
[0-9a-zA-Z]+-       ##Matching alphanumeric 1 or more occurrences followed by -
.*?-.*?,            ##Using Lazy match till - followed by another lazy match till comma.
[0-9]+              ##Matching digits 1 or more occurrences here.
(?:.*?,){2}         ##In a non-capturing group using lazy match till comma with its 2 occurrences.
[0-9]+              ##Matching digits 1 or more occurrences here.
(?:\.[0-9]+)?       ##Followed by dot followed by 1 or more digits to handle float matches keeping it optional.
\s+,                ##Matching 1 or more occurrences of spaces followed by comma.
Numbers\s+          ##Matching string Numbers followed by spaces 1 or more occurrences.
(?:.*?,){2}         ##In a non-capturing group matching till comma and making its 2 occurrences.
[0-9]+              ##Matching digits 1 or more occurrences here.
(?:\.[0-9]+)?       ##Followed by dot followed by 1 or more digits to handle float matches keeping it optional.
\s+,                ##Matching 1 or more occurrences of spaces followed by comma.
[0-9]+(?:\.[0-9]+)? ##Matching 1 or more digits followed by dot followed by 1 or more digits to handle float matches keeping it optional.
\b                  ##Using word boundary to make sure previous match it exact match.
)                   ##closing opened capturing group in starting.

Upvotes: 3

JvdV

Reputation: 75870

My two cents, assuming:

Each line of interest starts with a digit;
You'd like to actually spill each line of interest down.

Formula in A3:

=QUERY(FLATTEN(SPLIT(A1,CHAR(10))),"where Col1 matches '^\d.+'")

Upvotes: 1

Roko C. Buljan

Reputation: 206151

I would simply match the lines which start with digits,space,comma till the end of line:

/^\d+ *,.*$/gm

Demo on Regex101.com

^    Start of line
\d+  One or more digits
 *   Zero or more spaces
,    Comma
.*   Any character
$    End of line

/gm  Flags: Global; Multi line

Upvotes: 0

Extract many lines of text in a sheet cell using regex

Answers (3)

Related Questions