Reputation: 11
I need to write a script that finds all of the capitalized words (not words in all caps, just the initial letter) in a text file and presents them in alphabetical order.
I tried to use a regex like this:
re.findall(r'\b[A-Z][a-z]*\b', line)
but my function returns this output:
Enter the file name: bzip2.txt
['A', 'All', 'Altered', 'C', 'If', 'Julian', 'July', 'R', 'Redistribution', 'Redistributions', 'Seward', 'The', 'This']
How can I remove all the single-letter words (ex: A, C, and R)?
Upvotes: 1
Views: 121
Reputation: 11080
You can do this within the regex itself, no need to filter the array. Just use +
instead of *
:
re.findall(r'\b[A-Z][a-z]+\b', line)
In RegEx, *
means to match zero or more times, while +
means to match one or more times. Hence, your original code matched the lowercase letters zero times, so it was essentially ignored). With the +
, it will be forced to match at least once. You can learn more about this from this question and its answers.
Also, credit where credit is due: blhsing also pointed this out in the comments of the original question while I was writing this answer.
Upvotes: 5
Reputation: 18816
Instead of using a regex, split and directly check
Then call sorted()
to get a sorted list
>>> alphabet = set("ABCDEFGHIJKLMNOPQRSTUVWXYZ")
>>> sorted(filter(lambda word: len(word) >= 2 and word[0] in alphabet, my_collection))
['All', 'Altered', 'If', 'Julian', 'July', 'Redistribution', 'Redistributions', 'Seward', 'The', 'This']
Upvotes: 0