Reputation: 11
I need some help with printing duplicated last names in a text file (lower case and uppercase should be the same) The program do not print words with numbers (i.e. if the number appeared in last name or in the first name the whole name is ignored)
for example: my text file is :
Assaf Spanier, Assaf Din, Yo9ssi Levi, Yoram bibe9rman, David levi, Bibi Netanyahu, Amnon Levi, Ehud sPanier, Barak Spa7nier, Sara Neta4nyahu
the output should be:
Assaf
Assaf
David
Bibi
Amnon
Ehud
========
Spanier
Levi
import re
def delete_numbers(line):
words = re.sub(r'\w*\d\w*', '', line).strip()
for t in re.split(r',', words):
if len(t.split()) == 1:
words = re.sub(t, '',words)
words = re.sub(',,', '', words)
return words
fname = input("Enter file name: ")
file = open(fname,"r")
for line in file.readlines():
words = delete_numbers(line)
first_name = re.findall(r"([a-zA-Z]+)\s",words)
for i in first_name:
print(i)
print("***")
a = ""
for t in re.split(r',', words):
a+= (", ".join(t.split()[1:])) + " "
Upvotes: 0
Views: 1202
Reputation: 25789
Fine, since you insist on doing it using regex you should strive to do it in a single call so you don't suffer the penalty of context switches. The best approach would be to write a pattern to capture all first/last names that don't include numbers, separated by a comma, let the regex engine capture them all and then iterate over the matches and, finally, map them to a dictionary so you can split them as a last name => first name map:
import collections
import re
text = "Assaf Spanier, Assaf Din, Yo9ssi Levi, Yoram bibe9rman, David levi, " \
"Bibi Netanyahu, Amnon Levi, Ehud sPanier, Barak Spa7nier, Sara Neta4nyahu"
full_name = re.compile(r"(?:^|\s|,)([^\d\s]+)\s+([^\d\s]+)(?=>$|,)") # compile the pattern
matches = collections.OrderedDict() # store for the last=>first name map preserving order
for match in full_name.finditer(text):
first_name = match.group(1)
print(first_name) # print the first name to match your desired output
last_name = match.group(2).title() # capitalize the last name for case-insensitivity
if last_name in matches: # repeated last name
matches[last_name].append(first_name) # add the first name to the map
else: # encountering this last name for the first time
matches[last_name] = [first_name] # initialize the map for this last name
print("========") # print the separator...
# finally, print all the repeated last names to match your format
for k, v in matches.items():
if len(v) > 1: # print only those with more than one first name attached
print(k)
And this will give you:
Assaf Assaf David Bibi Amnon Ehud ======== Spanier Levi
In addition, you have the full last name => first names match in matches
.
When it comes to the pattern, let's break it down piece by piece:
(?:^|\s|,) - match the beginning of the string, whitespace or a comma (non-capturing) ([^\d\,]+) - followed by any number of characters that are not not digits or whitespace (capturing) \s+ - followed by one or more whitespace characters (non-capturing) ([^\d\s]+) - followed by the same pattern as for the first name (capturing) (?=>$|,) - followed by a comma or end of the string (look-ahead, non-capturing)
The two captured groups (first and last name) are then referenced in the match
object when we iterate over matches. Easy-peasy.
Upvotes: 0
Reputation: 95948
Ok, first let's start with an aside - opening files in an idiomatic way. Use the with
statement, which guarantees your file will be closed. For small scripts, this isn't a big deal, but if you ever start writing longer-lived programs, memory leaks due to incorrectly closed files can come back to haunt you. Since your file has everything on a single line:
with open(fname) as f:
data = f.read()
The file is now closed. This also encourages you to deal with your file immediately, and not leave it opened consuming resources unecessarily. Another aside, let's suppose you did have multiple lines. Instead of using for line in f.readlines()
, use the following construct:
with open(fname) as f:
for line in f:
do_stuff(line)
Since you don't actually need to keep the whole file, and only need to inspect each line, don't use readlines()
. Only use readlines()
if you want to keep a list of lines around, something like lines = f.readlines()
.
OK, finally, data will look something like this:
>>> print(data)
Assaf Spanier, Assaf Din, Yo9ssi Levi, Yoram bibe9rman, David levi, Bibi Netanyahu, Amnon Levi, Ehud sPanier, Barak Spa7nier, Sara Neta4nyahu
Ok, so if you want to use regex here, I suggest the following approach:
>>> names_regex = re.compile(r"^(\D+)\s(\D+)$")
The patter here, ^(\D+)\s(\D+)$
uses the non-digit group, \D
(the opposite of \d
, the digit group), and the white-space group, \s
. Also, it uses anchors, ^
and $
, to anchor the pattern to the beginning and end of the text respectively. Also, the parentheses create capturing groups, which we will leverage. Try copy-pasting this into http://regexr.com/ and play around with it if you still don't understand. One important note, use raw-strings, i.e. r"this is a raw string"
versus normal strings, "this is a normal string"
(notice the r
). This is because Python strings use some of the same escape characters as regex-patterns. This will help maintain your sanity. Ok, finally, I suggest using the grouping idiom, with a dict
>>> grouper = {}
Now, our loop:
>>> for fullname in data.split(','):
... match = names_regex.search(fullname.strip())
... if match:
... first, last = match.group(1), match.group(2)
... grouper.setdefault(last.title(), []).append(first.title())
...
Note, I used the .title
method to normalize all our names to "Titlecase". dict.setdefault
takes a key as it's first argument, and if the key doesn't exist, it sets the second argument as the value, and returns it. So, I am checking if the last-name, in title-case, exists in the grouper
dict, and if not, setting it to an empty list, []
, then append
ing to whatever is there!
Now pretty-printing for clarity:
>>> from pprint import pprint
>>> pprint(grouper)
{'Din': ['Assaf'],
'Levi': ['David', 'Amnon'],
'Netanyahu': ['Bibi'],
'Spanier': ['Assaf', 'Ehud']}
This is a very useful data-structure. We can, for example, get all last-names with more than a single first name:
>>> for last, firsts in grouper.items():
... if len(firsts) > 1:
... print(last)
...
Spanier
Levi
So, putting it all together:
>>> grouper = {}
>>> names_regex = re.compile(r"^(\D+)\s(\D+)$")
>>> for fullname in data.split(','):
... match = names_regex.search(fullname.strip())
... if match:
... first, last = match.group(1), match.group(2)
... first, last = first.title(), last.title()
... print(first)
... grouper.setdefault(last, []).append(first)
...
Assaf
Assaf
David
Bibi
Amnon
Ehud
>>> for last, firsts in grouper.items():
... if len(firsts) > 1:
... print(last)
...
Spanier
Levi
Note, I have assumed order doesn't matter, so I used a normal dict
. My output happens to be in the correct order because on Python 3.6, dict
s are ordered! But don't rely on this, since it is an implementation detail and not a guarantee. Use collections.OrderedDict
if you want to guarantee order.
Upvotes: 2