Reputation: 468
I am currently trying to do some natural language processing for company names.
The regex I wrote is -\s+\w+('\w+|\s+\w)
this is to remove all the text after the hyphen if its whitespace.
Next, I then [.,/#!$%\^&*;:{}=-_`''"<>|~()]
remove all punctuation. Third, I (Reg|Ltd|PLC|NV|LTD|LLC|INC|LLP|US)
remove the company suffix. Lastly, there are some names with carriage returns in front and at the end of the string, I resolve this with "\r*\n*
.
I would like to put all of these regex pieces together as I am running this in Alteryx & Python.
Please note: there are company names with hyphen that do not have whitespace after, I need to keep this and make sure they are not removed with the punctuation removal.
How can I combine all of these pieces? And, am I going about this correctly? In the end, after the string clean-up I will be joining this data to another client list to pull back specific information.
This is why all front-ends should NEVER contain a free text field especially for companies.
How do I go about combining these into one pattern, or is it better practice to separate each pattern?
Before
MY COMPANY X,Y,Z, TENNESSEE CORPORATION L.L.C.
MY COMPANY HOLDINGS, LP. (there is a carriage return after the LP.)
ABN FGDF - NEW YORK - UNITED STATES
COLLEGE-INRIA
ABCDE - UNITED STATES
MANAGEMENT MANAGERS - UNITED STATES
INVESTMENT MANAGEMENT CORPORATION - CANADA
AUTO-CHLOR
After
MY COMPANY XYZ TENNESSEE CORPORATION
MY COMPANY HOLDINGS
ABN FGDF
COLLEGE-INRIA
ABCDE
MANAGEMENT MANAGERS
INVESTMENT MANAGEMENT CORPORATION
AUTO-CHLOR
note that the COLLEGE-INRIA stayed as there was no whitespace between the hyphen and the next char.
Upvotes: 1
Views: 1652
Reputation: 26
I'm guessing you're well past an urgent need for response, but wanted to answer for posterity.
First, it's really a style question as to whether or not you keep each regex step separate or try to combine them into a single, impressively long, impossible to understand expression. (Your future self and/or others might thank you for keeping them separate.) There are some performance considerations to having fewer regex operations in some contexts, but by and large, I'd say it's better to be able to come back and make sense of what you were trying to do a year or more from now over saving a few cycles.
Second, regex certainly has it's uses, but I always ask myself if there is any way to avoid using Regex before I actually use it. Now you have two problems...
Finally, with that in mind, you can solve most of these parsing steps in Alteryx without the use of Regex and with similar performance.
If you are trying to do these things in the context of the Python SDK, then I'd still suggest keeping the multiple steps separate for future you and/or others.
Like most things, there are other ways to approach these issues in Alteryx and outside of Alteryx, but these are how I would go about it based on your initial question.
Upvotes: 1