JakeIC
JakeIC

Reputation: 327

Python - Regex - Match anything except

I'm trying to get my regular expression to work but can't figure out what I'm doing wrong. I am trying to find any file that is NOT in a specific format. For example all files are dates that are in this format MM-DD-YY.pdf (ex. 05-13-17.pdf). I want to be able to find any files that are not written in that format.

I can create a regex to find those with:

(\d\d-\d\d-\d\d\.pdf)

I tried using the negative lookahead so it looked like this:

(?!\d\d-\d\d-\d\d\.pdf)

That works in not finding those anymore but it doesn't find the files that are not like it.

I also tried adding a .* after the group but then that finds the whole list.

(?!\d\d-\d\d-\d\d\.pdf).*

I'm searching through a small list right now for testing:

05-17-17.pdf Test.pdf 05-48-2017.pdf 03-14-17.pdf

Is there a way to accomplish what I'm looking for?

Thanks!

Upvotes: 1

Views: 555

Answers (2)

HoofarLotusX
HoofarLotusX

Reputation: 572

First find all that are matching, then remove them from your list separately. firstFindtheMatching method first finds matching names using re library:

def firstFindtheMatching(listoffiles):
    """
    :listoffiles: list is the name of the files to check if they match a format
    :final_string: any file that doesn't match the format 01-01-17.pdf (MM-DD-YY.pdf) is put in one str type output. (ALSO) I'm returning the listoffiles so in that you can see the whole output in one place but you really won't need that. 

    """
    import re
    matchednames = re.findall("\d{1,2}-\d{1,2}-\d{1,2}\.pdf", listoffiles)
    #connect all output in one string for simpler handling using sets
    final_string = ' '.join(matchednames)
    return(final_string, listoffiles)

Here is the output:

('05-08-17.pdf 04-08-17.pdf 08-09-16.pdf', '05-08-17.pdf Test.pdf 04-08-17.pdf 08-09-16.pdf 08-09-2016.pdf some-all-letters.pdf')
set(['08-09-2016.pdf', 'some-all-letters.pdf', 'Test.pdf'])

I've used the main below if you like to regenerate the results. Good thing about doing it this way is that you can add more regex to your firstFindtheMatching(). It helps you to keep things separate.

def main():

    filenames= "05-08-17.pdf Test.pdf 04-08-17.pdf 08-09-16.pdf 08-09-2016.pdf some-all-letters.pdf"
    [matchednames , alllist] = firstFindtheMatching(filenames)
    print(matchednames, alllist)
    notcommon = set(filenames.split()) - set(matchednames.split())
    print(notcommon)




if __name__ == '__main__':
    main()

Upvotes: 0

Ajax1234
Ajax1234

Reputation: 71471

You can try this:

import re
s = "Test.docx 04-05-2017.docx 04-04-17.pdf secondtest.pdf"

new_data = re.findall("[a-zA-Z]+\.[a-zA-Z]+|\d{1,}-\d{1,}-\d{4}\.[a-zA-Z]+", s)

Output:

['Test.docx', '04-05-2017.docx', 'secondtest.pdf']

Upvotes: 1

Related Questions