user1590499
user1590499

Reputation: 993

Regular Expression - detecting duplicates

I have a dataset with the following kind of data:

company_id, company_name, country
1, a Tech, germany
2, a Tech AG, germany
3, a Tech gmbh, germany
4, AF, germany
5, AF gmbh, vermany

I have already assigned company_id's to these companies based on a preliminary search that assigned ID's to exact matches. Now, I want to do the following:

1) Write a regular expression that finds if a company name is exactly the same as the company name below it, except that the second company name has the suffix "gmbh" at the end of it.

I have everything done except for the logic behind getting the regular expression right. For example:

    for next_row in reader:
        first_name = first_row['company_name']
        next_name = next_row['company_name']

        if first_name == next_name:##FIX ME
            #do stuff
        writer.writerow(first_row)
        first_row = next_row

The logic for the equality test shouldn't be if first_name == next_name-- but rather if first_name equals last_name plus gmbh...

Would greatly appreciate any clarification!

Upvotes: 0

Views: 157

Answers (3)

mmdemirbas
mmdemirbas

Reputation: 9158

Algorithm

  1. Search for the regex (.*?)(\s+AG)? in the first_name string and replace it with the \1. This will give you company name without AG.
  2. Assign result to first_name_without_AG, then do this: next_name == first_name_without_AG + ' gmbh'

Sample Implementation

import re
first_name_without_AG = re.sub("(.*?)(\\s+AG)?", "\\1", first_name)
next_name == first_name_without_AG + ' gmbh'

Upvotes: 1

corn3lius
corn3lius

Reputation: 4985

his example has both AG and gmhb?

why not try something like this.

for next_row in reader:
    first_name = first_row['company_name']
    next_name = next_row['company_name']
    checkLength = len(first_name)

    if first_name == next_name[:checkLength] :  ##FIX ME
        #do stuff
    writer.writerow(first_row)
    first_row = next_row 

This only checks the length of the first name brought in and the suffix is ignored in the check.

Upvotes: 1

Joran Beasley
Joran Beasley

Reputation: 114038

I think what you want is something like

import re
regx = "([\w\s]+).*\1\s*gmbh"
re.findall(regx,my_target_text,re.MULTILINE)

something like that anyway (I think \1 captures first paren in regex ... but that part may be wrong)

also this sounds kinda like homework since you are asking about using regex but there is not much need to use regex

[edit/note] this is in no way a complete implementation and may require significant tweaking of the regex ... (but it will be simillar)

Upvotes: 1

Related Questions