Reputation: 111
I need to find all the words in a file which start with an upper case, I tried the below code but it returns an empty string.
import os
import re
matches = []
filename = 'C://Users/Documents/romeo.txt'
with open(filename, 'r') as f:
for line in f:
regex = "^[A-Z]\w*$"
matches.append(re.findall(regex, line))
print(matches)
File:
Hi, How are You?
Output:
[Hi,How,You]
Upvotes: 2
Views: 3538
Reputation: 626926
You can use
import os, re
matches = []
filename = r'C:\Users\Documents\romeo.txt'
with open(filename, 'r') as f:
for line in f:
matches.extend([x for x in re.findall(r'\w+', line) if x[0].isupper()])
print(matches)
The idea is to extract all words with a simple \w+
regex and add only those to the final matches
list that start with an uppercase letter.
See the Python demo.
NOTE: If you want to only match letter words use r'\b[^\W\d_]+\b'
regex.
This approach is Unicode friendly, that is, any Unicode word with the first capitalized letter will be found.
You also ask:
Is there a way to limit this to only words that start with an upper case letter and not all uppercase words
You can extend the previous code to
[x for x in re.findall(r'\w+', line) if x[0].isupper() and not x.isupper()]
See this Python demo, "Hi, How ARE You?"
yields ['Hi', 'How', 'You']
.
Or, to avoid getting CaMeL words in the output, use
matches.extend([x for x in re.findall(r'\w+', line) if x[0].isupper() and all(i.islower() for i in x[1:])])
See this Python demo where all(i.islower() for i in x[1:])
makes sure all letters after the first one are all lowercase.
Fully regex approach
You can use PyPi regex module that has support for both Unicode property and POSIX character classes, \p{Lu}
/\p{Ll}
and [:upper:]
/[:lower:]
. So, the solution will look like
import regex
text = "Hi, How ARE You?"
# Word starting with an uppercase letter:
print( regex.findall(r'\b\p{Lu}\p{L}*\b', text) )
## => ['Hi', 'How', 'ARE', 'You']
# Word starting with an uppercase letter but not ALLCAPS:
print( regex.findall(r'\b\p{Lu}\p{Ll}*\b', text) )
## => ['Hi', 'How', 'You']
See the Python demo online where
\b
- a word boundary\p{Lu}
- any uppercase letter\p{L}*
- any zero or more letters\p{Ll}*
- any zero or more lowercase lettersUpvotes: 4
Reputation: 163362
You can use a word boundary instead of the anchors ^
and $
\b[A-Z]\w*
Note that if you use matches.append
, you add an item to the list and re.findall returns a list, which will give you a list of lists.
import re
matches = []
regex = r"\b[A-Z]\w*"
filename = r'C:\Users\Documents\romeo.txt'
with open(filename, 'r') as f:
for line in f:
matches += re.findall(regex, line)
print(matches)
Output
['Hi', 'How', 'You']
If there should be a whitespace boundary to the left, you could also use
(?<!\S)[A-Z]\w*
If you don't want to match words using \w
with only uppercase chars, you could use for example a negative lookahead to assert not only uppercase chars till a word boundary
\b[A-Z](?![A-Z]*\b)\w*
\b
A word boundary to prevent a partial match[A-Z]
Match an uppercase char A-Z(?![A-Z]*\b)
Negative lookahead, assert not only uppercase chars followed by a word boundary\w*
Match optional word charsTo match a word that starts with an uppercase char, and does not contain any more uppercase chars:
\b[A-Z][^\WA-Z]*\b
\b
A word boundary[A-Z]
Match an uppercase char A-Z[^\WA-Z]*
Optionally match a word char without chars A-Z\b
A word boundaryUpvotes: 4