user3560844
user3560844

Reputation: 15

Extracting a string from html tags in python

Hopefully there isn't a duplicated question that I've looked over because I've been scouring this forum for someone who has posted to a similar to the one below...

Basically, I've created a python script that will scrape the callsigns of each ship from the url shown below and append them into a list. In short it works, however whenever I iterate through the list and display each element there seems to be a '[' and ']' between each of the callsigns. I've shown the output of my script below:

Output

***********************     Contents of 'listOfCallSigns' List     ***********************

0 ['311062900']
1 ['235056239']
2 ['305500000']
3 ['311063300']
4 ['236111791']
5 ['245639000']
6 ['235077805']
7 ['235011590']

As you can see, it shows the square brackets for each callsign. I have a feeling that this might be down to an encoding problem within the BeautifulSoup library.

Ideally, I want the output to be without any of the square brackets and just the callsign as a string.

***********************     Contents of 'listOfCallSigns' List     ***********************

0 311062900
1 235056239
2 305500000
3 311063300
4 236111791
5 245639000
6 235077805
7 235011590

This script I'm using currently is shown below:

My script

# Importing the modules needed to run the script 
from bs4 import BeautifulSoup
import urllib2
import re
import requests
import pprint


# Declaring the url for the port of hull
url = "http://www.fleetmon.com/en/ports/Port_of_Hull_5898"


# Opening and reading the contents of the URL using the module 'urlib2'
# Scanning the entire webpage, finding a <table> tag with the id 'vessels_in_port_table' and finding all <tr> tags
portOfHull = urllib2.urlopen(url).read()
soup = BeautifulSoup(portOfHull)
table = soup.find("table", {'id': 'vessels_in_port_table'}).find_all("tr")


# Declaring a list to hold the call signs of each ship in the table
listOfCallSigns = []


# For each row in the table, using a regular expression to extract the first 9 numbers from each ship call-sign
# Adding each extracted call-sign to the 'listOfCallSigns' list
for i, row in enumerate(table):
    if i:
        listOfCallSigns.append(re.findall(r"\d{9}", str(row.find_all('td')[4])))


print "\n\n***********************     Contents of 'listOfCallSigns' List     ***********************\n"

# Printing each element of the 'listOfCallSigns' list
for i, row in enumerate(listOfCallSigns):
    print i, row  

Does anyone know how to remove the square brackets surrounding each callsign and just display the string?

Thanks in advance! :)

Upvotes: 0

Views: 130

Answers (2)

ncocacola
ncocacola

Reputation: 485

Change the last lines to:

# Printing each element of the 'listOfCallSigns' list
for i, row in enumerate(listOfCallSigns):
    print i, row[0]  # <-- added a [0] here

Alternatively, you can also add the [0] here:

for i, row in enumerate(table):
    if i:
        listOfCallSigns.append(re.findall(r"\d{9}", str(row.find_all('td')[4]))[0]) <-- added a [0] here

The explanation here is that re.findall(...) returns a list (in your case, with a single element in it). So, listOfCallSigns ends up being a "list of sublists each containing a single string":

>>> listOfCallSigns
>>> [ ['311062900'], ['235056239'], ['311063300'], ['236111791'],
['245639000'], ['305500000'], ['235077805'], ['235011590'] ]

When you enumerate your listOfCallSigns, the row variable is basically the re.findall(...) that you appended earlier in the code (that's why you can add the [0] after either of them).

So row and re.findall(...) are both of type "list of string(s)" and look like this:

>>> row
>>> ['311062900']

And to get the string inside the list, you need access its first element, i.e.:

>>> row[0]
>>> '311062900'

Hope this helps!

Upvotes: 3

Takide
Takide

Reputation: 335

This can also be done by stripping the unwanted characters from the string like so:

a = "string with bad characters []'] in here" 
a = a.translate(None, "[]'")
print a 

Upvotes: 0

Related Questions