Reputation: 223
with requests.Session() as s:
auth_return = s.get('https://urproject.com/?page=com_auth_return')
soup = bs(auth_return.text,'html.parser')
what i got is like this.
<script type="text/javascript">
document.location = 'https://urproject.com/admin/php/user_id_check.php?EncData=abcdefg1234&EncKey=hijk9876';
</script>
with this, I want to get EncData and EncKey
EncData = soup.find_all("EncData")
EncKey = soup.find_all("EncKey")
encdatanenckey = {'EncData':EncData,
'EncKey':EncKey}
print(encdatanenckey)
the result would be
{'EncData': 'abcdefg1234', 'EncKey': 'hijk9876'}
How would I get this....do I have to work with Regex? I'm pretty noob with regex so...can you kindly give me some example?
Upvotes: 2
Views: 65
Reputation: 5434
I am assuming you need privileges to access the provided URL because it was unsuccessful when I tried. Anyways below would be a working example.
First, you need to get the URL from the HTML text, instead of a messy regex pattern and if all your returned HTML are the same:
import re
from bs4 import BeautifulSoup
t = '''<script type="text/javascript">document.location = 'https://urproject.com/admin/php/user_id_check.php?EncData=abcdefg1234&EncKey=hijk9876';</script>'''
soup = BeautifulSoup(t,'html.parser')
url = soup.text.split("'")[1]
url
>>'https://urproject.com/admin/php/user_id_check.php?EncData=abcdefg1234&EncKey=hijk9876'
For Python > 3.4, you can utilize the parse
method from urllib
which makes it super easy, and if you are not, you should really consider upgrading.
from urllib import parse
parse_url = parse.parse_qs(parse.urlparse(url).query)
EncData = parse_url['EncData'][0]
EncKey = parse_url['EncKey'][0]
encdatanenckey = {'EncData':EncData,
'EncKey':EncKey}
print(encdatanenckey)
>>{'EncData': 'abcdefg1234', 'EncKey': 'hijk9876'}
If you are not on Python > 3.4, you will have to manually split the strings to get the parameters, which would yield the same results:
EncData = [i.split("=")[-1] for i in url.split("?", 1)[-1].split("&") if i.startswith('EncData' + "=")][0]
EncKey = [i.split("=")[-1] for i in url.split("?", 1)[-1].split("&") if i.startswith('EncKey' + "=")][0]
encdatanenckey = {'EncData':EncData,
'EncKey':EncKey}
Upvotes: 2
Reputation: 43
If you already can get the URL isolated from the script content, a RegEx could be used this way:
import re
# re is a module that provides regular expression matching
url = 'https://urproject.com/admin/php/user_id_check.php?
EncData=abcdefg1234&EncKey=hijk9876' # this is your example URL
pattern =
re.compile(r'https:\/\/urproject.com\/admin\/php\/user_id_check\.php\?EncData=(.*?)\&EncKey=(.*)')
# this pattern is used to match any URL that has this same structure
result = pattern.match(url)
encdatanenckey = {
'EncData': result.group(1),
'EncKey': result.group(2)
}
print(encdatanenckey)
result.group(0), or equivalently result.group(), is the whole match. Parenthesis pick out submatches, called capture groups. With first parenthesis pair yielding result.group(1), second result.group(2), etc. Include '\' before some special characters to escape them (they have different functions inside a RegEx).
Upvotes: 1
Reputation: 3107
First you can use bs4 to extract script content, then match specific data by regex
from bs4 import BeautifulSoup
import re
html = """
<script type="text/javascript" ...></script>
<script type="text/javascript">
document.location = 'https://urproject.com/admin/php/user_id_check.php?EncData=abcdefg1234&EncKey=hijk9876';
</script>
"""
soup = BeautifulSoup(html,'lxml')
js_ = soup.find_all("script",text=True)
regex = r"(?<={}\=).*?(?=&|\'|\")"
EncData = [ re.search(regex.format("EncData"),url.text).group(0) for url in js_]
EncKey = [ re.search(regex.format("EncKey"),url.text).group(0) for url in js_]
encdatanenckey = {'EncData':EncData,
'EncKey':EncKey}
print(encdatanenckey)
# {'EncData': ['abcdefg1234'], 'EncKey': ['hijk9876']}
Upvotes: 2