stonecharioteer
stonecharioteer

Reputation: 1089

Matching a number in a file with Python

I have about 15,000 files I need to parse which could contain one or more strings/numbers from a list I have. I need to separate the files with matching strings.

Given a string: 3423423987, it could appear independently as "3423423987", or as "3423423987_1" or "3423423987_1a", "3423423987-1a", but it could also be "2133423423987". However, I only want to detect the matching sequence where it is not a part of another number, only when it has a suffix of some sort.

So 3423423987_1 is acceptable, but 13423423987 is not.

I'm having trouble with regex, haven't used it much to be honest.

Simply speaking, if I simulate this with a list of possible positives and negatives, I should get 7 hits, for the given list. I would like to extract the text till the end of the word, so that I can record that later.

Here's my code:

def check_text_for_string(text_to_parse, string_to_find):
    import re
    matches = []
    pattern = r"%s_?[^0-9,a-z,A-Z]\W"%string_to_find
    return re.findall(pattern, text_to_parse)

if __name__ =="__main__":
    import re
    word_to_match = "3423423987"
    possible_word_list = [
                    "3423423987_1 the cake is a lie", #Match
                    "3423423987sdgg call me Ishmael",  #Not a match
                    "3423423987 please sir, can I have some more?", #Match
                    "3423423987", #Match
                    "3423423987 ", #Match
                    "3423423987\t", #Match
                    "adsgsdzgxdzg adsgsdag\t3423423987\t", #Match
                    "1233423423987", #Not a match
                    "A3423423987", #Not a match
                    "3423423987-1a\t", #Match
                    "3423423987.0", #Not a match
                    "342342398743635645" #Not a match
                    ]

    print("%d words in sample list."%len(possible_word_list))
    print("Only 7 should match.")
    matches = check_text_for_string("\n".join(possible_word_list), word_to_match)
    print("%d matched."%len(matches))
    print(matches)

But clearly, this is wrong. Could someone help me out here?

Upvotes: 3

Views: 582

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626845

It seems you just want to make sure the number is not matched as part of a, say, float number. You then need to use lookarounds, a lookbehind and a lookahead to disallow dots with digits before and after.

(?<!\d\.)(?:\b|_)3423423987(?:\b|_)(?!\.\d)

See the regex demo

To also match the "prefixes" (or, better call them "suffixes" here), you need to add something like \S* (zero or more non-whitespaces) or (?:[_-]\w+)? (an optional sequence of a - or _ followed with 1+ word chars) at the end of the pattern.

Details:

  • (?<!\d\.) - fail the match if we have a digit and a dot before the current position
  • (?:\b|_) - either a word boundary or a _ (we need it as _ is a word char)
  • 3423423987 - the search string
  • (?:\b|_) - ibid
  • (?!\.\d) - fail the match if a dot + digit is right after the current position.

So, use

pattern = r"(?<!\d\.)(?:\b|_)%s(?:\b|_)(?!\.\d)"%string_to_find

See the Python demo

If there can be floats like Text with .3423423987 float value, you will need to also add another lookbehind (?<!\.) after the first one: (?<!\d\.)(?<!\.)(?:\b|_)3423423987(?:\b|_)(?!\.\d)

Upvotes: 3

Aran-Fey
Aran-Fey

Reputation: 43166

You can use this pattern:

(?:\b|^)3423423987(?!\.)(?=\b|_|$)

(?:\b|^) asserts that there are no other numbers to the left

(?!\.) asserts the number isn't followed by a dot

(?=\b|_|$) asserts the number is followed by a non word character, an underscore or nothing

Upvotes: 1

Related Questions