Fruitspunchsamurai
Fruitspunchsamurai

Reputation: 404

Scraping values from a webpage table

I want to create a python dictionary of color names to background color from this color dictionary.

What is the best way to access the color name strings and the background color hex values? I want to create a mapping for color name --> hex values, where 1 color name maps to 1 or more hex values.

The following is my code:

import requests
from bs4 import BeautifulSoup

page = requests.get('http://people.csail.mit.edu/jaffer/Color/M.htm')
soup = BeautifulSoup(page.text)

I'm not sure how to specify what to scrape from the table. I've tried the following to get a format that's useful:

soup.td
<td nowrap="" width="175*">abbey</td>


soup.get_text()
"(M)\n td { padding: 0 10px; }  \n\n(M) Dictionary of Color  Maerz and Paul, Dictionary of Color, 1st ed.   \n\nabbey207\nabsinthe [green]  120\nabsinthe yellow105\nacacia101102\nacademy blue173\nacajou43\nacanthe95\nacier109\nackermann's green137\naconite violet223....
.............\nyolk yellow84\nyosemite76\nyucatan5474\nyucca150\nyu chi146\nyvette violet228\n\nzaffre blue  179182\nzanzibar47\nzedoary wash71\nzenith [blue]  199203\nzephyr78\nzinc233265\nzinc green136\nzinc orange5053\nzinc yellow84\nzinnia15\nzulu47\nzuni brown58\n\n"


soup.select('tr td')
[...
<td nowrap="" width="175*">burnt russet</td>,
 <td style="background-color:#722F37; color:#FFF" title="16">16</td>,
 <td style="background-color:#79443B; color:#FFF" title="43">43
 </td>,
 <td nowrap="" width="175*">burnt sienna</td>,
 <td style="background-color:#9E4732; color:#FFF" title="38">38
 </td>,
 ...]

EDIT: I want to scrape the strings in the td elements e.g "burnt russet" as the color and the string (hex component) in the following td elements where the "style" attribute is specified as the background color.

I want the dictionary to look as follows:

color_map = {'burnt russet': [#722F37, #79443B], 'burnt sienna': [#9E4732]}

Upvotes: 1

Views: 331

Answers (2)

Padraic Cunningham
Padraic Cunningham

Reputation: 180391

Just look for the tds with nowrap, extract the text and get the following siblings td's style attribute:

import requests
from bs4 import BeautifulSoup

page = requests.get('http://people.csail.mit.edu/jaffer/Color/M.htm')
soup = BeautifulSoup(page.content)
for td in soup.select("td[nowrap]"):
    print(td.text, [sib["style"] for sib in td.find_next_siblings("td")])

A snippet of the output:

 (u'abbey', ['background-color:#604E97; color:#FFF'])
(u'absinthe [green]  ', ['background-color:#8A9A5B'])
(u'absinthe yellow', ['background-color:#B9B57D'])
(u'acacia', ['background-color:#EAE679', 'background-color:#B9B459'])
(u'academy blue', ['background-color:#367588'])
(u'acajou', ['background-color:#79443B; color:#FFF'])
(u'acanthe', ['background-color:#6C541E; color:#FFF'])
(u'acier', ['background-color:#8C8767'])
(u"ackermann's green", ['background-color:#355E3B; color:#FFF'])
(u'aconite violet', ['background-color:#86608E'])
(u'acorn', ['background-color:#7E6D5A; color:#FFF'])
(u'adamia', ['background-color:#563C5C; color:#FFF'])
(u'adelaide', ['background-color:#32174D; color:#FFF'])

If you just want the hex values you can split the style text on "; " then split the sub strings on ::

page = requests.get('http://people.csail.mit.edu/jaffer/Color/M.htm')
soup = BeautifulSoup(page.content)
d = {}
for td in soup.select("td[nowrap]"):
    cols = td.find_next_siblings("td")
    d[td.text] = [st.split(":", 1)[-1] for sib in cols for st in sib["style"].split("; ")]

print(d)

That will give you a dioct like:

{u'moonlight ': ['#FAD6A5', '#BFB8A5'], u'honey bird': ['#239EBA'], u'monte carlo ': ['#007A74', '#317873'],...............

You will need to use either lxml or html5lib as the parser to handle the broken html. I presume you are using one of them as if not you would not get the output you do.

Upvotes: 3

James Pringle
James Pringle

Reputation: 1079

The webpage that you are trying to scrape is horribly-formed HTML. After View Page Source, it is apparent that most rows start with a <tr> and then have one or more <td> elements, all without their closing tags. With BeautifulSoup one should specify an HTML parser, and for the case at hand, we had better hope the parser can understand the table structure.

I present a solution that relies on the structured format of the webpage itself. Instead of parsing the webpage as HTML, I use the fact that each color has its own line and each line has a common format.

import re
import requests

page = requests.get('http://people.csail.mit.edu/jaffer/Color/M.htm')
lines = page.text.splitlines()

opening = '<tr><td width="175*" nowrap>'
ending = '<td title="'
bg_re = r'style="background-color:(#.{6})'
color_map = dict()
for line in lines:
    if line.startswith(opening):
        color_name = line[len(opening):line.find(ending)].strip()
        color_hex = [match.group(1) for match in re.finditer(bg_re, line)]
        if color_name in color_map: 
            color_map[color_name].extend(color_hex)  # Note: some colors are repeated
        else:
            color_map[color_name] = color_hex

color_map['burnt russet']
## ['#722F37', '#79443B']

Quick and dirty, but it works.

Upvotes: 1

Related Questions