Andrius
Andrius

Reputation: 31

Regex in Python. NOT matches

I'll go straight: I have a string like this (but with thousands of lines)

Ach-emos_2
Ach. emos_54
Achėmos_18
Ąžuolas_4
Somtehing else_2

and I need to remove lines that does not match a-z and ąčęėįšųūž plus _ plus any integer (3rd and 4th lines match this). And this should be case insensitive. I think regex should be

[a-ząčęėįšųūž]+_\d+ #don't know where to put case insensitive modifier

But how should look a regex that matches lines that are NOT alpha (and lithuanian letters) plus underscore plus integer? I tried

re.sub(r'[^a-ząčęėįšųūž]+_\d+\n', '', words)

but no good.

Thanks in advance, sorry if my english is not quite good.

Upvotes: 2

Views: 16397

Answers (3)

user557597
user557597

Reputation:

Not sure how python does modifiers, but to edit in-place, use something like this (case insensitive):

edit Note that some of these characters are utf8. To use the literal representation your editor and language must support this, otherwise use the \u.. code in the character class (recommended).

s/(?i)^(?![a-ząčęėįšųūž]+_\d+(?:\n|$)).*(?:\n|$)//mg;

where the regex is: r'(?i)^(?![a-ząčęėįšųūž]+_\d+(?:\n|$)).*(?:\n|$)'
the replacement is ''
modifier is multiline and global.

Breakdown: modifiers are global and multiline

(?i)                              // case insensitive flag
^                                 // start of line
(?![a-ząčęėįšųūž]+_\d+(?:\n|$))   // look ahead, not this form of a line ?
.*                                // ok then select all except newline or eos
(?:\n|$)                          // select newline or end of string

Upvotes: 0

kojiro
kojiro

Reputation: 77089

First of all, given your example inputs, every line ends with underscore + integers, so all you really need to do is invert the original match. If the example wasn't really representative, then inverting the match could land you results like this:

abcdefg_nodigitshere

But you can subfilter that this way:

import re
mydigre = re.compile(r'_\d+$')
myreg = re.compile(r'^[a-ząčęėįšųūž]+_\d+$', re.I)

for line in inputs.splitlines():
    if re.match(myreg, line):
        # do x
    elif re.match(mydigre, line):
        # do y
    else:
        # line doesn't end with _\d+

Another option would be to use Python sets. This approach only makes sense if all your lines are unique (or if you don't mind eliminating duplicate lines) and you don't care about order. It probably has a high memory cost, too, but is likely to be fast.

all_lines = set([line for line in inputs.splitlines()])
alpha_lines = set([line for line in all_lines if re.match(myreg, line)])
nonalpha_lines = all_lines - alpha_lines
nonalpha_digi_lines = set([line for line in nonalpha_lines if re.match(mydigire, line)])

Upvotes: 0

Sven Marnach
Sven Marnach

Reputation: 601529

As to making the matching case insensitive, you can use the I or IGNORECASE flags from the re module, for example when compiling your regex:

regex = re.compile("^[a-ząčęėįšųūž]+_\d+$", re.I)

As to removing the lines not matching this regex, you can simply construct a new string consisting of the lines that do match:

new_s = "\n".join(line for line in s.split("\n") if re.match(regex, line))

Upvotes: 5

Related Questions