PeterPiper
PeterPiper

Reputation: 19

Regular Expression to Search Quantity in Human Written Descriptions

Hello and thank you in advance,

I buy items that have a variety of human written listings on auction sites and forums. Often times, the quantity is clear to a person, but extracting it has been a real challenge. I'm using google sheets and REGEXEXTRACT().

I consider myself to be a intermediate regex user, but this has me stumped, so I need an expert.

Here's a few examples, my desired return, and what I'm getting.

Listing Desired Return Actual Return
Red 1996 Corvette 2x - Matchbox 2 2
3 x SmartCar, broken 2nd door 3 3
2nd edition Kindle (x3) 3 3
**1x** 2008 financial crash notice 1 1
Collectors Edition Beannie Baby, item 204/343 1 4
(6) Nissan window motors (1995-1998 ONLY) 6 N/A
White chevy F150, 1996 1 6
Green bowl, cracked (stored in room 2A5) 1 5

As I thought through this, I think I can put some reasonable limitations on this logic, but the code is harder.

Obviously most description don't have any quantity, so if there's no return or zero, that's fine.

I have something that feels close, and does a reasonable job:

[^a-wy-zA-WY-Z0-9]*([1-4]){1}([^a-wA-w0-9]|$)

It doesn't return anything with the returns marked of 1*, and that's fine. It breaks on the last two, and I've struggled for too long!

Thanks in advance!

Upvotes: 0

Views: 64

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627126

You can use

=IFNA(INT(REGEXEXTRACT(REGEXREPLACE(LOWER(A27), "\d{2,}|(x\d)|(\dx)|[^\W\d]+\d\w*|\d+[^\W\d]\w*", "$1$2"), "(\d)")), 1)

Here,

  • REGEXREPLACE(LOWER(A27), "\d{2,}|(x\d)|(\dx)|[^\W\d]+\d\w*|\d+[^\W\d]\w*", "$1$2") finds and removes chunks of two or more digits, or chunks with a digit and at least one letter, but keeps the sequences where a digit is preceded or followed with x
  • REGEXEXTRACT(..., "(\d)")) extracts the first digit left after the replacement
  • =IFNA(INT(...), 1) either casts the found digit to integer, or, if there was no match, inserts 1 into the column.

See the long regex demo.

  • \d{2,} - two or more digits
  • | - or
  • (x\d) - Group 1 ($1): x and a digit
  • | - or
  • (\dx) - Group 2 ($2): a digit and x
  • | - or
  • [^\W\d]+\d\w* - one or more word chars except digits, a digit and then zero or more word chars
  • | - or
  • \d+[^\W\d]\w* - one or more digits, a letter or underscore, and then zero or more word chars.

Demo:

enter image description here

Upvotes: 1

Related Questions