Taewoo.Lim
Taewoo.Lim

Reputation: 223

beautifulsoup4 python working with parsed data

with requests.Session() as s:
auth_return = s.get('https://urproject.com/?page=com_auth_return')
soup = bs(auth_return.text,'html.parser')

what i got is like this.

<script type="text/javascript">
document.location = 'https://urproject.com/admin/php/user_id_check.php?EncData=abcdefg1234&EncKey=hijk9876';
</script>

with this, I want to get EncData and EncKey

EncData = soup.find_all("EncData")
EncKey = soup.find_all("EncKey")

encdatanenckey = {'EncData':EncData,
             'EncKey':EncKey}

print(encdatanenckey)

the result would be

{'EncData': 'abcdefg1234', 'EncKey': 'hijk9876'}

How would I get this....do I have to work with Regex? I'm pretty noob with regex so...can you kindly give me some example?

Upvotes: 2

Views: 65

Answers (3)

BernardL
BernardL

Reputation: 5434

I am assuming you need privileges to access the provided URL because it was unsuccessful when I tried. Anyways below would be a working example.


First, you need to get the URL from the HTML text, instead of a messy regex pattern and if all your returned HTML are the same:

import re
from bs4 import BeautifulSoup

t = '''<script type="text/javascript">document.location = 'https://urproject.com/admin/php/user_id_check.php?EncData=abcdefg1234&EncKey=hijk9876';</script>'''

soup = BeautifulSoup(t,'html.parser')
url = soup.text.split("'")[1]
url
>>'https://urproject.com/admin/php/user_id_check.php?EncData=abcdefg1234&EncKey=hijk9876'

For Python > 3.4, you can utilize the parse method from urllib which makes it super easy, and if you are not, you should really consider upgrading.

from urllib import parse
parse_url = parse.parse_qs(parse.urlparse(url).query)
EncData = parse_url['EncData'][0]
EncKey = parse_url['EncKey'][0]

encdatanenckey = {'EncData':EncData,
             'EncKey':EncKey}

print(encdatanenckey)
>>{'EncData': 'abcdefg1234', 'EncKey': 'hijk9876'}

If you are not on Python > 3.4, you will have to manually split the strings to get the parameters, which would yield the same results:

EncData = [i.split("=")[-1] for i in url.split("?", 1)[-1].split("&") if i.startswith('EncData' + "=")][0]
EncKey = [i.split("=")[-1] for i in url.split("?", 1)[-1].split("&") if i.startswith('EncKey' + "=")][0]
encdatanenckey = {'EncData':EncData,
             'EncKey':EncKey}

Upvotes: 2

HugoHonda
HugoHonda

Reputation: 43

If you already can get the URL isolated from the script content, a RegEx could be used this way:

import re
# re is a module that provides regular expression matching

url = 'https://urproject.com/admin/php/user_id_check.php? 
EncData=abcdefg1234&EncKey=hijk9876' # this is your example URL

pattern =
re.compile(r'https:\/\/urproject.com\/admin\/php\/user_id_check\.php\?EncData=(.*?)\&EncKey=(.*)')
# this pattern is used to match any URL that has this same structure
result = pattern.match(url)

encdatanenckey = {
    'EncData': result.group(1),
    'EncKey': result.group(2)
}

print(encdatanenckey)

result.group(0), or equivalently result.group(), is the whole match. Parenthesis pick out submatches, called capture groups. With first parenthesis pair yielding result.group(1), second result.group(2), etc. Include '\' before some special characters to escape them (they have different functions inside a RegEx).

Upvotes: 1

KC.
KC.

Reputation: 3107

First you can use bs4 to extract script content, then match specific data by regex

from bs4 import BeautifulSoup
import re

html = """
<script type="text/javascript" ...></script>
<script type="text/javascript">
document.location = 'https://urproject.com/admin/php/user_id_check.php?EncData=abcdefg1234&EncKey=hijk9876';
</script>
"""
soup = BeautifulSoup(html,'lxml')
js_ = soup.find_all("script",text=True)
regex = r"(?<={}\=).*?(?=&|\'|\")"
EncData = [ re.search(regex.format("EncData"),url.text).group(0)  for url in js_]
EncKey = [ re.search(regex.format("EncKey"),url.text).group(0)  for url in js_]

encdatanenckey = {'EncData':EncData,
             'EncKey':EncKey}

print(encdatanenckey)
# {'EncData': ['abcdefg1234'], 'EncKey': ['hijk9876']}

Upvotes: 2

Related Questions