Reputation: 1089
I have about 15,000 files I need to parse which could contain one or more strings/numbers from a list I have. I need to separate the files with matching strings.
Given a string: 3423423987, it could appear independently as "3423423987", or as "3423423987_1" or "3423423987_1a", "3423423987-1a", but it could also be "2133423423987". However, I only want to detect the matching sequence where it is not a part of another number, only when it has a suffix of some sort.
So 3423423987_1 is acceptable, but 13423423987 is not.
I'm having trouble with regex, haven't used it much to be honest.
Simply speaking, if I simulate this with a list of possible positives and negatives, I should get 7 hits, for the given list. I would like to extract the text till the end of the word, so that I can record that later.
Here's my code:
def check_text_for_string(text_to_parse, string_to_find):
import re
matches = []
pattern = r"%s_?[^0-9,a-z,A-Z]\W"%string_to_find
return re.findall(pattern, text_to_parse)
if __name__ =="__main__":
import re
word_to_match = "3423423987"
possible_word_list = [
"3423423987_1 the cake is a lie", #Match
"3423423987sdgg call me Ishmael", #Not a match
"3423423987 please sir, can I have some more?", #Match
"3423423987", #Match
"3423423987 ", #Match
"3423423987\t", #Match
"adsgsdzgxdzg adsgsdag\t3423423987\t", #Match
"1233423423987", #Not a match
"A3423423987", #Not a match
"3423423987-1a\t", #Match
"3423423987.0", #Not a match
"342342398743635645" #Not a match
]
print("%d words in sample list."%len(possible_word_list))
print("Only 7 should match.")
matches = check_text_for_string("\n".join(possible_word_list), word_to_match)
print("%d matched."%len(matches))
print(matches)
But clearly, this is wrong. Could someone help me out here?
Upvotes: 3
Views: 582
Reputation: 626845
It seems you just want to make sure the number is not matched as part of a, say, float number. You then need to use lookarounds, a lookbehind and a lookahead to disallow dots with digits before and after.
(?<!\d\.)(?:\b|_)3423423987(?:\b|_)(?!\.\d)
See the regex demo
To also match the "prefixes" (or, better call them "suffixes" here), you need to add something like \S*
(zero or more non-whitespaces) or (?:[_-]\w+)?
(an optional sequence of a -
or _
followed with 1+ word chars) at the end of the pattern.
Details:
(?<!\d\.)
- fail the match if we have a digit and a dot before the current position(?:\b|_)
- either a word boundary or a _
(we need it as _
is a word char)3423423987
- the search string(?:\b|_)
- ibid(?!\.\d)
- fail the match if a dot + digit is right after the current position.So, use
pattern = r"(?<!\d\.)(?:\b|_)%s(?:\b|_)(?!\.\d)"%string_to_find
See the Python demo
If there can be floats like Text with .3423423987 float value
, you will need to also add another lookbehind (?<!\.)
after the first one: (?<!\d\.)(?<!\.)(?:\b|_)3423423987(?:\b|_)(?!\.\d)
Upvotes: 3
Reputation: 43166
You can use this pattern:
(?:\b|^)3423423987(?!\.)(?=\b|_|$)
(?:\b|^)
asserts that there are no other numbers to the left
(?!\.)
asserts the number isn't followed by a dot
(?=\b|_|$)
asserts the number is followed by a non word character, an underscore or nothing
Upvotes: 1