Reputation: 11

finding duplicate words in a string and print them using re

I need some help with printing duplicated last names in a text file (lower case and uppercase should be the same) The program do not print words with numbers (i.e. if the number appeared in last name or in the first name the whole name is ignored)

for example: my text file is :

Assaf Spanier, Assaf Din, Yo9ssi Levi, Yoram bibe9rman, David levi, Bibi Netanyahu, Amnon Levi, Ehud sPanier, Barak Spa7nier, Sara Neta4nyahu

the output should be:

Assaf

Assaf

David

Bibi

Amnon

Ehud

========

Spanier

Levi

import re

def delete_numbers(line):
   words = re.sub(r'\w*\d\w*', '', line).strip()
   for t in re.split(r',', words):
      if len(t.split()) == 1:
         words = re.sub(t, '',words)
         words = re.sub(',,', '', words)
   return words


fname = input("Enter file name: ")
file = open(fname,"r")
for line in file.readlines():
   words = delete_numbers(line)
   first_name = re.findall(r"([a-zA-Z]+)\s",words)
   for i in first_name:
      print(i)
   print("***")

a = ""
for t in re.split(r',', words):
  a+= (", ".join(t.split()[1:])) + " "

Upvotes: 0

Answers (2)

zwer

Reputation: 25789

Fine, since you insist on doing it using regex you should strive to do it in a single call so you don't suffer the penalty of context switches. The best approach would be to write a pattern to capture all first/last names that don't include numbers, separated by a comma, let the regex engine capture them all and then iterate over the matches and, finally, map them to a dictionary so you can split them as a last name => first name map:

import collections
import re

text = "Assaf Spanier, Assaf Din, Yo9ssi Levi, Yoram bibe9rman, David levi, " \
       "Bibi Netanyahu, Amnon Levi, Ehud sPanier, Barak Spa7nier, Sara Neta4nyahu"

full_name = re.compile(r"(?:^|\s|,)([^\d\s]+)\s+([^\d\s]+)(?=>$|,)")  # compile the pattern

matches = collections.OrderedDict()  # store for the last=>first name map preserving order
for match in full_name.finditer(text):
    first_name = match.group(1)
    print(first_name)  # print the first name to match your desired output
    last_name = match.group(2).title()  # capitalize the last name for case-insensitivity
    if last_name in matches:  # repeated last name
        matches[last_name].append(first_name)  # add the first name to the map
    else:  # encountering this last name for the first time
        matches[last_name] = [first_name]  # initialize the map for this last name
print("========")  # print the separator...
# finally, print all the repeated last names to match your format
for k, v in matches.items():
    if len(v) > 1:  # print only those with more than one first name attached
        print(k)

And this will give you:

Assaf
Assaf
David
Bibi
Amnon
Ehud
========
Spanier
Levi

In addition, you have the full last name => first names match in matches.

When it comes to the pattern, let's break it down piece by piece:

(?:^|\s|,) - match the beginning of the string, whitespace or a comma (non-capturing)
  ([^\d\,]+) - followed by any number of characters that are not not digits or whitespace
               (capturing)
    \s+  - followed by one or more whitespace characters (non-capturing)
      ([^\d\s]+) - followed by the same pattern as for the first name (capturing)
         (?=>$|,) - followed by a comma or end of the string  (look-ahead, non-capturing)

The two captured groups (first and last name) are then referenced in the match object when we iterate over matches. Easy-peasy.

Upvotes: 0

juanpa.arrivillaga

Reputation: 95948

Ok, first let's start with an aside - opening files in an idiomatic way. Use the with statement, which guarantees your file will be closed. For small scripts, this isn't a big deal, but if you ever start writing longer-lived programs, memory leaks due to incorrectly closed files can come back to haunt you. Since your file has everything on a single line:

with open(fname) as f:
    data = f.read()

The file is now closed. This also encourages you to deal with your file immediately, and not leave it opened consuming resources unecessarily. Another aside, let's suppose you did have multiple lines. Instead of using for line in f.readlines(), use the following construct:

with open(fname) as f:
    for line in f:
        do_stuff(line)

Since you don't actually need to keep the whole file, and only need to inspect each line, don't use readlines(). Only use readlines() if you want to keep a list of lines around, something like lines = f.readlines().

OK, finally, data will look something like this:

>>> print(data)
Assaf Spanier, Assaf Din, Yo9ssi Levi, Yoram bibe9rman, David levi, Bibi Netanyahu, Amnon Levi, Ehud sPanier, Barak Spa7nier, Sara Neta4nyahu

Ok, so if you want to use regex here, I suggest the following approach:

>>> names_regex = re.compile(r"^(\D+)\s(\D+)$")

The patter here, ^(\D+)\s(\D+)$ uses the non-digit group, \D (the opposite of \d, the digit group), and the white-space group, \s. Also, it uses anchors, ^ and $, to anchor the pattern to the beginning and end of the text respectively. Also, the parentheses create capturing groups, which we will leverage. Try copy-pasting this into http://regexr.com/ and play around with it if you still don't understand. One important note, use raw-strings, i.e. r"this is a raw string" versus normal strings, "this is a normal string" (notice the r). This is because Python strings use some of the same escape characters as regex-patterns. This will help maintain your sanity. Ok, finally, I suggest using the grouping idiom, with a dict

>>> grouper = {}

Now, our loop:

>>> for fullname in data.split(','):
...     match = names_regex.search(fullname.strip())
...     if match:
...         first, last = match.group(1), match.group(2)
...         grouper.setdefault(last.title(), []).append(first.title())
...

Note, I used the .title method to normalize all our names to "Titlecase". dict.setdefault takes a key as it's first argument, and if the key doesn't exist, it sets the second argument as the value, and returns it. So, I am checking if the last-name, in title-case, exists in the grouper dict, and if not, setting it to an empty list, [], then appending to whatever is there!

Now pretty-printing for clarity:

>>> from pprint import pprint
>>> pprint(grouper)
{'Din': ['Assaf'],
 'Levi': ['David', 'Amnon'],
 'Netanyahu': ['Bibi'],
 'Spanier': ['Assaf', 'Ehud']}

This is a very useful data-structure. We can, for example, get all last-names with more than a single first name:

>>> for last, firsts in grouper.items():
...     if len(firsts) > 1:
...         print(last)
...
Spanier
Levi

So, putting it all together:

>>> grouper = {}
>>> names_regex = re.compile(r"^(\D+)\s(\D+)$")
>>> for fullname in data.split(','):
...     match = names_regex.search(fullname.strip())
...     if match:
...         first, last = match.group(1), match.group(2)
...         first, last = first.title(), last.title()
...         print(first)
...         grouper.setdefault(last, []).append(first)
...
Assaf
Assaf
David
Bibi
Amnon
Ehud
>>> for last, firsts in grouper.items():
...     if len(firsts) > 1:
...         print(last)
...
Spanier
Levi

Note, I have assumed order doesn't matter, so I used a normal dict. My output happens to be in the correct order because on Python 3.6, dicts are ordered! But don't rely on this, since it is an implementation detail and not a guarantee. Use collections.OrderedDict if you want to guarantee order.

Upvotes: 2

finding duplicate words in a string and print them using re

Answers (2)

Related Questions