Scraping values from a webpage table

Question

I want to create a python dictionary of color names to background color from this color dictionary.

What is the best way to access the color name strings and the background color hex values? I want to create a mapping for color name --> hex values, where 1 color name maps to 1 or more hex values.

The following is my code:

import requests
from bs4 import BeautifulSoup

page = requests.get('http://people.csail.mit.edu/jaffer/Color/M.htm')
soup = BeautifulSoup(page.text)

I'm not sure how to specify what to scrape from the table. I've tried the following to get a format that's useful:

soup.td
abbey


soup.get_text()
"(M)
 td { padding: 0 10px; }  

(M) Dictionary of Color  Maerz and Paul, Dictionary of Color, 1st ed.   

abbey207
absinthe [green]  120
absinthe yellow105
acacia101102
academy blue173
acajou43
acanthe95
acier109
ackermann's green137
aconite violet223....
.............
yolk yellow84
yosemite76
yucatan5474
yucca150
yu chi146
yvette violet228

zaffre blue  179182
zanzibar47
zedoary wash71
zenith [blue]  199203
zephyr78
zinc233265
zinc green136
zinc orange5053
zinc yellow84
zinnia15
zulu47
zuni brown58

"


soup.select('tr td')
[...
burnt russet,
 16,
 43
 ,
 burnt sienna,
 38
 ,
 ...]

EDIT: I want to scrape the strings in the td elements e.g "burnt russet" as the color and the string (hex component) in the following td elements where the "style" attribute is specified as the background color.

I want the dictionary to look as follows:

color_map = {'burnt russet': [#722F37, #79443B], 'burnt sienna': [#9E4732]}

Padraic Cunningham · Accepted Answer

Just look for the tds with nowrap, extract the text and get the following siblings td's style attribute:

import requests
from bs4 import BeautifulSoup

page = requests.get('http://people.csail.mit.edu/jaffer/Color/M.htm')
soup = BeautifulSoup(page.content)
for td in soup.select("td[nowrap]"):
    print(td.text, [sib["style"] for sib in td.find_next_siblings("td")])

A snippet of the output:

 (u'abbey', ['background-color:#604E97; color:#FFF'])
(u'absinthe [green]  ', ['background-color:#8A9A5B'])
(u'absinthe yellow', ['background-color:#B9B57D'])
(u'acacia', ['background-color:#EAE679', 'background-color:#B9B459'])
(u'academy blue', ['background-color:#367588'])
(u'acajou', ['background-color:#79443B; color:#FFF'])
(u'acanthe', ['background-color:#6C541E; color:#FFF'])
(u'acier', ['background-color:#8C8767'])
(u"ackermann's green", ['background-color:#355E3B; color:#FFF'])
(u'aconite violet', ['background-color:#86608E'])
(u'acorn', ['background-color:#7E6D5A; color:#FFF'])
(u'adamia', ['background-color:#563C5C; color:#FFF'])
(u'adelaide', ['background-color:#32174D; color:#FFF'])

If you just want the hex values you can split the style text on "; " then split the sub strings on ::

page = requests.get('http://people.csail.mit.edu/jaffer/Color/M.htm')
soup = BeautifulSoup(page.content)
d = {}
for td in soup.select("td[nowrap]"):
    cols = td.find_next_siblings("td")
    d[td.text] = [st.split(":", 1)[-1] for sib in cols for st in sib["style"].split("; ")]

print(d)

That will give you a dioct like:

{u'moonlight ': ['#FAD6A5', '#BFB8A5'], u'honey bird': ['#239EBA'], u'monte carlo ': ['#007A74', '#317873'],...............

You will need to use either lxml or html5lib as the parser to handle the broken html. I presume you are using one of them as if not you would not get the output you do.

Scraping values from a webpage table

Answers (2)

Related Questions