RegEx for removing company suffix and keep original, or positive lookahead?

Question

I am currently trying to do some natural language processing for company names.

The regex I wrote is -\s+\w+('\w+|\s+\w) this is to remove all the text after the hyphen if its whitespace. Next, I then [.,/#!$%\^&*;:{}=-_`''"<>|~()] remove all punctuation. Third, I (Reg|Ltd|PLC|NV|LTD|LLC|INC|LLP|US) remove the company suffix. Lastly, there are some names with carriage returns in front and at the end of the string, I resolve this with " * *.

I would like to put all of these regex pieces together as I am running this in Alteryx & Python.

Please note: there are company names with hyphen that do not have whitespace after, I need to keep this and make sure they are not removed with the punctuation removal.

How can I combine all of these pieces? And, am I going about this correctly? In the end, after the string clean-up I will be joining this data to another client list to pull back specific information.

This is why all front-ends should NEVER contain a free text field especially for companies.

How do I go about combining these into one pattern, or is it better practice to separate each pattern?

Before MY COMPANY X,Y,Z, TENNESSEE CORPORATION L.L.C. MY COMPANY HOLDINGS, LP. (there is a carriage return after the LP.) ABN FGDF - NEW YORK - UNITED STATES COLLEGE-INRIA ABCDE - UNITED STATES MANAGEMENT MANAGERS - UNITED STATES INVESTMENT MANAGEMENT CORPORATION - CANADA AUTO-CHLOR

After MY COMPANY XYZ TENNESSEE CORPORATION MY COMPANY HOLDINGS ABN FGDF COLLEGE-INRIA ABCDE MANAGEMENT MANAGERS INVESTMENT MANAGEMENT CORPORATION AUTO-CHLOR

note that the COLLEGE-INRIA stayed as there was no whitespace between the hyphen and the next char.

Esoterik · Accepted Answer

I'm guessing you're well past an urgent need for response, but wanted to answer for posterity.

First, it's really a style question as to whether or not you keep each regex step separate or try to combine them into a single, impressively long, impossible to understand expression. (Your future self and/or others might thank you for keeping them separate.) There are some performance considerations to having fewer regex operations in some contexts, but by and large, I'd say it's better to be able to come back and make sense of what you were trying to do a year or more from now over saving a few cycles.

Second, regex certainly has it's uses, but I always ask myself if there is any way to avoid using Regex before I actually use it. Now you have two problems...

Finally, with that in mind, you can solve most of these parsing steps in Alteryx without the use of Regex and with similar performance.

The removal of the hyphen followed by a space can be accomplished with a text to column tool using the pattern " -" (space + hyphen) and then only work with the first column that results from that for the rest of the workflow (or use a Select tool to remove the trash columns entirely).
You can remove all whitespace (including , , etc) as well as all special characters with a Data Cleansing tool with the appropriate boxes checked in the Remove Unwanted Characters section. You can make this speedier by doing this after removing the unwanted parts of the original string. BUT, this will remove the wanted hyphens not encapsulated in white space, so...
You can setup a simple Formula tool expression with the pattern you already have for matching all of the special characters you want to replace using the REGEX_Replace() function. You could also use the Find and Replace tool, or a bunch of nested Replace() functions, but, in this case, the REGEX_Replace() function is probably the most concise and easy to understand, assuming anyone that will have to maintain the workflow will be able to deal with Regex.

If you are trying to do these things in the context of the Python SDK, then I'd still suggest keeping the multiple steps separate for future you and/or others.

Like most things, there are other ways to approach these issues in Alteryx and outside of Alteryx, but these are how I would go about it based on your initial question.

RegEx for removing company suffix and keep original, or positive lookahead?

Answers (1)

Related Questions