Reputation: 404
I want to create a python dictionary of color names to background color from this color dictionary.
What is the best way to access the color name strings and the background color hex values? I want to create a mapping for color name --> hex values, where 1 color name maps to 1 or more hex values.
The following is my code:
import requests
from bs4 import BeautifulSoup
page = requests.get('http://people.csail.mit.edu/jaffer/Color/M.htm')
soup = BeautifulSoup(page.text)
I'm not sure how to specify what to scrape from the table. I've tried the following to get a format that's useful:
soup.td
<td nowrap="" width="175*">abbey</td>
soup.get_text()
"(M)\n td { padding: 0 10px; } \n\n(M) Dictionary of Color Maerz and Paul, Dictionary of Color, 1st ed. \n\nabbey207\nabsinthe [green] 120\nabsinthe yellow105\nacacia101102\nacademy blue173\nacajou43\nacanthe95\nacier109\nackermann's green137\naconite violet223....
.............\nyolk yellow84\nyosemite76\nyucatan5474\nyucca150\nyu chi146\nyvette violet228\n\nzaffre blue 179182\nzanzibar47\nzedoary wash71\nzenith [blue] 199203\nzephyr78\nzinc233265\nzinc green136\nzinc orange5053\nzinc yellow84\nzinnia15\nzulu47\nzuni brown58\n\n"
soup.select('tr td')
[...
<td nowrap="" width="175*">burnt russet</td>,
<td style="background-color:#722F37; color:#FFF" title="16">16</td>,
<td style="background-color:#79443B; color:#FFF" title="43">43
</td>,
<td nowrap="" width="175*">burnt sienna</td>,
<td style="background-color:#9E4732; color:#FFF" title="38">38
</td>,
...]
EDIT: I want to scrape the strings in the td elements e.g "burnt russet" as the color and the string (hex component) in the following td elements where the "style" attribute is specified as the background color.
I want the dictionary to look as follows:
color_map = {'burnt russet': [#722F37, #79443B], 'burnt sienna': [#9E4732]}
Upvotes: 1
Views: 331
Reputation: 180391
Just look for the tds with nowrap, extract the text and get the following siblings td's style attribute:
import requests
from bs4 import BeautifulSoup
page = requests.get('http://people.csail.mit.edu/jaffer/Color/M.htm')
soup = BeautifulSoup(page.content)
for td in soup.select("td[nowrap]"):
print(td.text, [sib["style"] for sib in td.find_next_siblings("td")])
A snippet of the output:
(u'abbey', ['background-color:#604E97; color:#FFF'])
(u'absinthe [green] ', ['background-color:#8A9A5B'])
(u'absinthe yellow', ['background-color:#B9B57D'])
(u'acacia', ['background-color:#EAE679', 'background-color:#B9B459'])
(u'academy blue', ['background-color:#367588'])
(u'acajou', ['background-color:#79443B; color:#FFF'])
(u'acanthe', ['background-color:#6C541E; color:#FFF'])
(u'acier', ['background-color:#8C8767'])
(u"ackermann's green", ['background-color:#355E3B; color:#FFF'])
(u'aconite violet', ['background-color:#86608E'])
(u'acorn', ['background-color:#7E6D5A; color:#FFF'])
(u'adamia', ['background-color:#563C5C; color:#FFF'])
(u'adelaide', ['background-color:#32174D; color:#FFF'])
If you just want the hex values you can split the style text on "; "
then split the sub strings on :
:
page = requests.get('http://people.csail.mit.edu/jaffer/Color/M.htm')
soup = BeautifulSoup(page.content)
d = {}
for td in soup.select("td[nowrap]"):
cols = td.find_next_siblings("td")
d[td.text] = [st.split(":", 1)[-1] for sib in cols for st in sib["style"].split("; ")]
print(d)
That will give you a dioct like:
{u'moonlight ': ['#FAD6A5', '#BFB8A5'], u'honey bird': ['#239EBA'], u'monte carlo ': ['#007A74', '#317873'],...............
You will need to use either lxml
or html5lib
as the parser to handle the broken html. I presume you are using one of them as if not you would not get the output you do.
Upvotes: 3
Reputation: 1079
The webpage that you are trying to scrape is horribly-formed HTML. After View Page Source
, it is apparent that most rows start with a <tr>
and then have one or more <td>
elements, all without their closing tags. With BeautifulSoup
one should specify an HTML parser, and for the case at hand, we had better hope the parser can understand the table structure.
I present a solution that relies on the structured format of the webpage itself. Instead of parsing the webpage as HTML, I use the fact that each color has its own line and each line has a common format.
import re
import requests
page = requests.get('http://people.csail.mit.edu/jaffer/Color/M.htm')
lines = page.text.splitlines()
opening = '<tr><td width="175*" nowrap>'
ending = '<td title="'
bg_re = r'style="background-color:(#.{6})'
color_map = dict()
for line in lines:
if line.startswith(opening):
color_name = line[len(opening):line.find(ending)].strip()
color_hex = [match.group(1) for match in re.finditer(bg_re, line)]
if color_name in color_map:
color_map[color_name].extend(color_hex) # Note: some colors are repeated
else:
color_map[color_name] = color_hex
color_map['burnt russet']
## ['#722F37', '#79443B']
Quick and dirty, but it works.
Upvotes: 1