user-44651
user-44651

Reputation: 4124

Regex Not Matching Pattern from BeautifulSoup results

I'm attempting to parse some HTML to look for a RegEx. When I use online tools to validate my regex expression, it works properly. It finds the value. However, when I use BeautifulSoup with RegEx the pattern fails to find the expression.

I am looking to grab this data: /some/path/to/file?accountTransactionID=f2448439-ec25-4a61-a6f4-4c6fa0767f19&accountNumber=123456&searchValue=ABC123&isActiveHistory=True

From this line:

 var url = '/some/path/to/file?accountTransactionID=f2448439-ec25-4a61-a6f4-4c6fa0767f19&accountNumber=123456&searchValue=ABC123&isActiveHistory=True'

In the below demo html.

Here is the Python script I'm working with. I have used several SO questions, including this one, but have not and any success.

If I use soup = BeautifulSoup(fp, 'html.parser').find_all(string=PATTERN) then the full text of the script has been stored in an array. I've tried looping through the array to find the text again, but it always comes up empty.

What have I done wrong?


Python:

FILE_PATH = os.getcwd() + '/demo.html'
PATTERN = re.compile('var url = \'(.*?)\'')

with open(FILE_PATH) as fp:
    soup = BeautifulSoup(fp, 'html.parser')  # .find_all(string=PATTERN)
    data = PATTERN.match(str(soup))
    print(f'Data: {data}')
    # for script in soup:
    #     print(script)
    #     data = PATTERN.match(str(script))
    #     if data is not None:
    #         print(f'Data: {data}')
    #     else:
    #         print('NO DATA FOUND')

Outputs: Data: None


HTML:

<!DOCTYPE html>
<html lang="en">
<head>
    <script src="/some/path/1"></script>
    <script src="/some/path/2"></script>
    <script src="/some/path/31"></script>
</head>
<body>
<script type="text/javascript">
        function downloadFile() {
            var readyToDownload = 'f2448439-ec25-4a61-a6f4-4c6fa0767f19';
            if (readyToDownload !== '')
            {
                var url = '/some/path/to/file?accountTransactionID=f2448439-ec25-4a61-a6f4-4c6fa0767f19&amp;accountNumber=123456&amp;searchValue=ABC123&amp;isActiveHistory=True'
                url = url.replace(/&amp;/g, "&")
                window.open(url, '_blank');
            }
        }
    </script>
</body>
</html>

Upvotes: 0

Views: 166

Answers (1)

CodeMonkey
CodeMonkey

Reputation: 23738

BeautifulSoup isn't really doing much help for this. BeautifulSoup.find() or findall() will return the element that contains the text which in this case is a <script> element.

Just match the whole text from the file and call search() instead of match() on the pattern. The match() function starts matching the characters at the beginning of string so will not find a match.

Try this:

with open(FILE_PATH) as fp:
    html = fp.read()
m = PATTERN.search(html)
if m:
    print(m.group(1))

Output:

/some/path/to/file?accountTransactionID=f2448439-ec25-4a61-a6f4-4c6fa0767f19&amp;accountNumber=123456&amp;searchValue=ABC123&amp;isActiveHistory=True

Upvotes: 1

Related Questions