Reputation: 993
I have a dataset with the following kind of data:
company_id, company_name, country
1, a Tech, germany
2, a Tech AG, germany
3, a Tech gmbh, germany
4, AF, germany
5, AF gmbh, vermany
I have already assigned company_id's to these companies based on a preliminary search that assigned ID's to exact matches. Now, I want to do the following:
1) Write a regular expression that finds if a company name is exactly the same as the company name below it, except that the second company name has the suffix "gmbh" at the end of it.
I have everything done except for the logic behind getting the regular expression right. For example:
for next_row in reader:
first_name = first_row['company_name']
next_name = next_row['company_name']
if first_name == next_name:##FIX ME
#do stuff
writer.writerow(first_row)
first_row = next_row
The logic for the equality test shouldn't be if first_name == next_name-- but rather if first_name equals last_name plus gmbh...
Would greatly appreciate any clarification!
Upvotes: 0
Views: 157
Reputation: 9158
(.*?)(\s+AG)?
in the first_name
string and replace it with the \1
. This will give you company name without AG
.first_name_without_AG
, then do this: next_name == first_name_without_AG + ' gmbh'
import re
first_name_without_AG = re.sub("(.*?)(\\s+AG)?", "\\1", first_name)
next_name == first_name_without_AG + ' gmbh'
Upvotes: 1
Reputation: 4985
his example has both AG and gmhb?
why not try something like this.
for next_row in reader:
first_name = first_row['company_name']
next_name = next_row['company_name']
checkLength = len(first_name)
if first_name == next_name[:checkLength] : ##FIX ME
#do stuff
writer.writerow(first_row)
first_row = next_row
This only checks the length of the first name brought in and the suffix is ignored in the check.
Upvotes: 1
Reputation: 114038
I think what you want is something like
import re
regx = "([\w\s]+).*\1\s*gmbh"
re.findall(regx,my_target_text,re.MULTILINE)
something like that anyway (I think \1 captures first paren in regex ... but that part may be wrong)
also this sounds kinda like homework since you are asking about using regex but there is not much need to use regex
[edit/note] this is in no way a complete implementation and may require significant tweaking of the regex ... (but it will be simillar)
Upvotes: 1