sas
sas

Reputation: 13

Extracting a value from a HTML string (PYTHON)

I'm having some trouble extracting some data from this string:

<input class="mail-address-address" id="mailAddress" readonly="readonly" type="text" value="THE_EMAIL_ADDRESS_HERE"/>).

How could I store the value= into my own variable? I've thought about splitting but don't think you can split a whole word. Could you split at a certain char by .count() method? Thank you, hopefully I could get some help on this. Thanks

EDIT:

I'm trying to get the id by converting it to HTML since splinter did not seem to get the content in ID (it was just blank)

site = "https://10minutemail.com/10MinuteMail/index.html?dswid=9902"
req = urllib2.Request(site, headers=hdr)
page = urllib2.urlopen(req)
content = page.read()
soup = BeautifulSoup(content)
address-address" id="mailAddress" readonly="readonly">')
find = soup.find("class", {"id": "mailAddress"})
findId = soup.find(id="mailAddress")

the variable findId prints this:

<input class="mail-address-address" id="mailAddress" readonly="readonly" type="text" value="[email protected]"/>)

@Sidney

html_line= '''<input class="mail-address-address" id="mailAddress" readonly="readonly" type="text" value="[email protected]"/>)'''
input_value=html_line.split('value="',1)[1].rsplit('"',1)[0]
print(input_value) 

This works fine except the domain name changes. ''' means I can't use my own variable (findId). Is there a work around for this?

Upvotes: 1

Views: 9669

Answers (3)

Corey Goldberg
Corey Goldberg

Reputation: 60604

You should really use an html parser to parse html (not a regex or string manipulation). For example, you can use BeautifulSoup.

first, install the package:

pip install beautifulsoup4

then use it to to grab the value from your input tag:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
val = soup.input['value']  # val now contains the string 'THE_EMAIL_ADDRESS_HERE'
print(val)

Upvotes: 3

user3672754
user3672754

Reputation:

As @Daniel Roseman says, it would be nice to have some more context. Normally when parsing HTML you can use libraries like BeautifulSoup. A good example for your case is Python beautifulsoup - getting input value.

If you want to code your own parser, or if you need something simple, you can even use split():

html_line='''<input class="mail-address-address" id="mailAddress" readonly="readonly" type="text" value="THE_EMAIL_ADDRESS_HERE"/>)'''
input_value=html_line.split('value="',1)[1].rsplit('"',1)[0]

I'd better advice you to use BeautifulSoup (and if you wan't a simple parser, better use @sidney's answer)

Upvotes: 2

sidney
sidney

Reputation: 827

This would be pretty messy to handle using .split(), so I would suggest using regular expressions (if you choose not to use HTML parsing libraries). To use regex, you need to import the re module, and use the following regular expression, " +value=\"(.*?)\"", like so:

import re
yourString = '<input class="mail-address-address" id="mailAddress" readonly="readonly" type="text" value="THE_EMAIL_ADDRESS_HERE"/>'

# m is the match object, containing data about the regex search.
m = re.search(" +value=\"(.*?)\"", yourString)

# To retrieve the content captured inside the parentheses inside the regex, look for saved matches.
value = m.group(1)

The regex searches for:

  • one or more spaces, followed directly by,
  • the literal string value=", followed directly by,
  • zero or more of any characters, searched non-greedily (this is so that the regex match doesn't keep searching after it encounters the end of the value). This bit is what we're looking to store.
  • a closing "

Upvotes: 2

Related Questions