nonamelive
nonamelive

Reputation: 6510

RegExp in Python

This is an example that searches PDF files in the current directory.

import os, os.path
import re

def print_pdf (arg, dir, files):
 for file in files:
  path = os.path.join(dir, file)
  path = os.path.normcase(path)
  if re.search(r".*\.pdf", path):
   print path

os.path.walk('.', print_pdf, 0)

Could anyone explain what r".*\.pdf" means here?

Why ".*\"?

Thanks!

Upvotes: 0

Views: 428

Answers (6)

ghostdog74
ghostdog74

Reputation: 342443

use os.walk() instead. And there's no need to use regex.

for r,d,f in os.walk(path):
    for files in f:
        if files[-4:].lower() == ".pdf":
             print "found pdf: ",os.path.join(r,files)

Upvotes: 0

SilentGhost
SilentGhost

Reputation: 319621

it means any character zero or more times, followed by the literal dot and letters pdf (due to the greedy nature of the asterisk, it's basically guaranteed that the '.pdf' are going to be at the end of the subject string).

There is glob module to do this the right way:

>>> glob.glob(os.path.join(dirname, '*.pdf'))

Upvotes: 8

S.Hoekstra
S.Hoekstra

Reputation: 326

The period (.)
will match any character except newlines

The following asterisk (*)
means unlimited number of repetitions of the preceding period

The backslash ()
escapes the period in .pdf So it looks for a real period, so ONLY the .pdf and not "any character".pdf again'

So in the end it looks for Any piece of text that ends in .pdf

Upvotes: 0

Konrad Rudolph
Konrad Rudolph

Reputation: 545618

Why ".*\"?

Wrong question, you missed a crucial character of the expression. ;-)

In fact, .* will match any character (. in regex), as many times as possible (* in regex; relates to the previous string, so . in this case).

\., on the other hand, will match exactly one dot (.). \ escapes the following character (.) so it does no longer have its special meaning (e.g. in this case “match any character”) but rather it will be treated as-is.

Upvotes: 3

Aaron
Aaron

Reputation: 1072

This searches for a string containing zero or more chars followed by ".pdf" The .* is a common idiom in regexps and it means match any char 0 or more times. The . is because in regexps, the . has a special meaning, and the \ escapes that.

Upvotes: 1

Ignacio Vazquez-Abrams
Ignacio Vazquez-Abrams

Reputation: 798744

The . means match any character but "\n". The * means "repeat the previous character 0 or more times". The \. matches an actual ".".

BTW, this is all in the docs.

Upvotes: 2

Related Questions