Shiv Shah
Shiv Shah

Reputation: 11

Find the words from the file whose first letter is capitalized

I need to write a script that finds all of the capitalized words (not words in all caps, just the initial letter) in a text file and presents them in alphabetical order.

I tried to use a regex like this:

re.findall(r'\b[A-Z][a-z]*\b', line)

but my function returns this output:

Enter the file name: bzip2.txt
['A', 'All', 'Altered', 'C', 'If', 'Julian', 'July', 'R', 'Redistribution', 'Redistributions', 'Seward', 'The', 'This']

How can I remove all the single-letter words (ex: A, C, and R)?

Upvotes: 1

Views: 121

Answers (2)

Michael M.
Michael M.

Reputation: 11080

You can do this within the regex itself, no need to filter the array. Just use + instead of *:

re.findall(r'\b[A-Z][a-z]+\b', line)

In RegEx, * means to match zero or more times, while + means to match one or more times. Hence, your original code matched the lowercase letters zero times, so it was essentially ignored). With the +, it will be forced to match at least once. You can learn more about this from this question and its answers.

Also, credit where credit is due: blhsing also pointed this out in the comments of the original question while I was writing this answer.

Upvotes: 5

ti7
ti7

Reputation: 18816

Instead of using a regex, split and directly check

  • has at least 2 characters
  • first letter is a capitalized letter

Then call sorted() to get a sorted list

>>> alphabet = set("ABCDEFGHIJKLMNOPQRSTUVWXYZ")
>>> sorted(filter(lambda word: len(word) >= 2 and word[0] in alphabet, my_collection))
['All', 'Altered', 'If', 'Julian', 'July', 'Redistribution', 'Redistributions', 'Seward', 'The', 'This']

Upvotes: 0

Related Questions