Mel
Mel

Reputation: 175

Reading Regular Expressions from a text file

I'm currently trying to write a function that takes two inputs:

1 - The URL for a web page 2 - The name of a text file containing some regular expressions

My function should read the text file line by line (each line being a different regex) and then it should execute the given regex on the web page source code. However, I've ran in to trouble doing this:

example Suppose I want the address contained on a Yelp with URL = http://www.yelp.com/biz/liberty-grill-cork where the regex is \<address\>\s*([^<]*)\\b\s*<. In Python, I then run:

address = re.search('\<address\>\s*([^<]*)\\b\s*<', web_page_source_code)

The above will work, however, if I just write the regex in a text file as is, and then read the regex from the text file, then it won't work. So reading the regex from a text file is what is causing the problem, how can I rectify this?

EDIT: This is how I'm reading the regexes from the text file:

with open("test_file.txt","r") as file:
    for regex in file:
        address = re.search(regex, web_page_source_code)

Just to add, the reason I want to read regexes from a text file is so that my function code can stay the same and I can alter my list of regexes easily. If anyone can suggest any other alternatives that would be great.

Upvotes: 2

Views: 1968

Answers (2)

Mel
Mel

Reputation: 175

OK, I managed to get it working. For anyone who wants to read regular expressions from text files, you need to do the following:

  1. Ensure that regex in the text file is entered in the right format (thanks to MightyPork for pointing that out)
  2. You also need to remove the newline '\n' character at the end

So overall, your code should look something like:

a = open("test_file.txt","r")
line = a.readline()
line = line.strip('\n')
result = re.search(line,page_source_code)

Upvotes: 1

MightyPork
MightyPork

Reputation: 18861

Your string has some backlashes and other things escaped to avoid special meaning in Python string, not only the regex itself.

You can easily verify what happens when you print the string you load from the file. If your backslashes doubled, you did it wrong.

The text you want in the file is:

File

\<address\>\s*([^<]*)\b\s*<

Here's how you can check it

In [1]: a = open('testfile.txt')

In [2]: line = a.readline()

-- this is the line as you'd see it in python code when properly escaped

In [3]: line
Out[3]: '\\<address\\>\\s*([^<]*)\\b\\s*<\n'

-- this is what it actually means (what re will use)

In [4]: print(line)
\<address\>\s*([^<]*)\b\s*<

Upvotes: 1

Related Questions