Parsing a long html using BeautifulSoup failed with half parsed output

Question

I used the following script to parse the fund price of a particular fund:

import pandas as pd
from bs4 import BeautifulSoup
from ghost import Ghost
ghost = Ghost()
page,resources = ghost.open('http://bank.hangseng.com/1/PA_1_1_P1/ComSvlet_MiniSite_eng_gif?app=eINVCFundDetailsOV&pri_fund_code=U44217')
page,resources = ghost.evaluate("agree()", expect_loading=True)
page,resources = ghost.evaluate("MM_changeview('eINVCFundPriceDividend')", expect_loading=True)
# ghost.capture_to("hangseng.png")
soup = BeautifulSoup(page.content)
soup

The output soup is OK for the first half, but the tag all turned in uppercase and BeautifulSoup cannot parse them, like the one below:

22-07-201410.9500011.3900010.95000

 T R   V A L I G N = " t o p "   a l i g n = " c e n t e r " > 
 T D   C L A S S = " L i g h t G r e y "   V A L I G N = " T O P " > F O N T   C L A S S = " C O N T E N T " > 2 1 - 0 7 - 2 0 1 4 / F O N T > / T D > T D   C L A S S = " L i g h t G r e y "   V A L I G N = " T O P " > F O N T   C L A S S = " C O N T E N T " > 1 0 . 9 6 0 0 0 / F O N T > / T D > T D   C L A S S = " L i g h t G r e y "   V A L I G N = " T O P " > F O N T   C L A S S = " C O N T E N T " > 1 1 . 4 0 0 0 0 / F O N T > / T D > T D   C L A S S = " L i g h t G r e y "   V A L I G N = " T O P " > F O N T   C L A S S = " C O N T E N T " > 1 0 . 9 6 0 0 0 / F O N T > / T D > 
 / T R >

You can see the output becomes garbage after the date 2014-07-22.

What happened?

lokheart · Accepted Answer

I found a solution from spaced output beautifulsoup

page.content
soup = BeautifulSoup(page.content,'html.parser')

Now it works perfectly.

Parsing a long html using BeautifulSoup failed with half parsed output

Answers (1)

Related Questions