I am writing code to combine functions from the python rawdog RSS reader library and the BeautifulSoup webscraping library. There is a conflict somewhere in the innards that I am trying to overcome. I can replicate the problem with this simplified code: import sys, gzip def scrape(filename): contents = gzip.open(filename,'rb').read() contents = contents.decode('utf-8','replace') import BeautifulSoup as BS print 'before rawdog: ', len(BS.BeautifulSoup(contents)) # prints 4, correct answer from rawdoglib import rawdog as rd print 'after rawdog: ', len(BS.BeautifulSoup(contents)) # prints 3, incorrect answer It does not matter what order or where I do the imports, the import of rawdog always causes the BS.BeautifulSoup() method to return the wrong response. I don't actually need rawdog anymore by the time I get to needing BeautifulSoup, so I've tried removing the package at that point, but BS is still broken. Fixes I have tried that have not worked: I noticed that the rawdog code does its own import of BeautifulSoup. So I tried removing import BeautifulSoup from the rawdog code and re-installing rawdog removing the rawdog modules before importing BeautifulSoup: for x in filter(lambda y: y.startswith('rawdog'), sys.modules.keys()): del sys.modules[x] importing more specific classes/methods from rawdog, e.g from rawdoglib.rawdog import FeedState give the problem method a new name, before and after importing rawdog: from BeautifulSoup import BeautifulSoup as BS from __future__ import absolute_import No luck, I always get len(BeautifulSoup(contents)) == 3 if rawdog was ever imported into the namespace. Both packages are complex enough that I haven't been able to figure out exactly what the problem overlap is, and I'm not sure what tools to use to try to figure that out, other than searching through dir(BeautifulSoup) and dir(rawdog), where I haven't found good clues. Updates, responding to answers: I omitted that the problem does not occur with every input file, which is crucial, sorry. The offending files are quite large so I don't think I can post them here. I will try to figure out the crucial difference between the good and bad files and post it. Thanks for the debugging help so far. Further debugging! I have identified this block in the input text as problematic: function SwitchMenu(obj){ if(document.getElementById){ var el = document.getElementById(obj); var ar = document.getElementById("masterdiv").getElementsByTagName("span"); //DynamicDrive.com change if(el.style.display != "block"){ //DynamicDrive.com change for (var i=0; i<ar.length; i++){ if (ar[i].className=="submenu") //DynamicDrive.com change ar[i].style.display = "none"; } el.style.display = "block"; }else{ el.style.display = "none"; } } } If I comment out this block, then I get the correct parse through BeautifulSoup with or without the rawdog import. With the block, rawdog + BeautifulSoup is faulty. So should I just search my input for a block like this, or is there a better workaround?

Reputation: 151

python conflicts in two external packages

I am writing code to combine functions from the python rawdog RSS reader library and the BeautifulSoup webscraping library. There is a conflict somewhere in the innards that I am trying to overcome.

I can replicate the problem with this simplified code:

    import sys, gzip
    def scrape(filename):
        contents = gzip.open(filename,'rb').read()
        contents = contents.decode('utf-8','replace')
        import BeautifulSoup as BS
        print 'before rawdog: ', len(BS.BeautifulSoup(contents)) # prints 4, correct answer
        from rawdoglib import rawdog as rd
        print 'after rawdog: ', len(BS.BeautifulSoup(contents)) # prints 3, incorrect answer

It does not matter what order or where I do the imports, the import of rawdog always causes the BS.BeautifulSoup() method to return the wrong response. I don't actually need rawdog anymore by the time I get to needing BeautifulSoup, so I've tried removing the package at that point, but BS is still broken. Fixes I have tried that have not worked:

I noticed that the rawdog code does its own import of BeautifulSoup. So I tried removing import BeautifulSoup from the rawdog code and re-installing rawdog
removing the rawdog modules before importing BeautifulSoup:
- for x in filter(lambda y: y.startswith('rawdog'), sys.modules.keys()): del sys.modules[x]
importing more specific classes/methods from rawdog, e.g from rawdoglib.rawdog import FeedState
give the problem method a new name, before and after importing rawdog: from BeautifulSoup import BeautifulSoup as BS
from __future__ import absolute_import

No luck, I always get len(BeautifulSoup(contents)) == 3 if rawdog was ever imported into the namespace. Both packages are complex enough that I haven't been able to figure out exactly what the problem overlap is, and I'm not sure what tools to use to try to figure that out, other than searching through dir(BeautifulSoup) and dir(rawdog), where I haven't found good clues.

Updates, responding to answers: I omitted that the problem does not occur with every input file, which is crucial, sorry. The offending files are quite large so I don't think I can post them here. I will try to figure out the crucial difference between the good and bad files and post it. Thanks for the debugging help so far.

Further debugging! I have identified this block in the input text as problematic:

    function SwitchMenu(obj){
      if(document.getElementById){
      var el = document.getElementById(obj);
      var ar = document.getElementById("masterdiv").getElementsByTagName("span"); //DynamicDrive.com change
         if(el.style.display != "block"){ //DynamicDrive.com change
         for (var i=0; i<ar.length; i++){
            if (ar[i].className=="submenu") //DynamicDrive.com change
            ar[i].style.display = "none";
      }
      el.style.display = "block";
      }else{
        el.style.display = "none";
    }
}

}

If I comment out this block, then I get the correct parse through BeautifulSoup with or without the rawdog import. With the block, rawdog + BeautifulSoup is faulty. So should I just search my input for a block like this, or is there a better workaround?

Upvotes: 9

Answers (3)

alexis

Reputation: 50200

If rawdog can trigger the bug without importing BeautifulSoup (I take it you've checked that it's not imported indirectly?), they must have a shared dependency that is somehow loaded inconsistently. But the problem need not be monkey-patching: If they load different versions of the same library, you can get inconsistent behavior. E.g., if one of them uses a special import path, provides its own version of a top-level module, or has code like this:

try: 
    import ElementPath 
except ImportError: 
    ElementPath = _SimpleElementPath()

To see if this is the problem, try the following: Load BeautifulSoup by itself, nothing else, and dump the list of modules and their location:

import BeautifulSoup
import sys
sys.stdout = open("soup-modules.txt", "w")
for k,v in sorted(sys.modules.items()):
    if v:
        print k, v.__dict__.get('__file__')

Then do the same with rawdog and diff the outputs. If you see a module with the same name but a different origin, that's probably your culprit.

Upvotes: 0

lbolla

Reputation: 5411

It's a bug in rawdoglib.feedparser.py. rawdog is monkey patching smglib: on line 198 it reads:

if sgmllib.endbracket.search(' <').start(0):
    class EndBracketMatch:
        endbracket = re.compile('''([^'"<>]|"[^"]*"(?=>|/|\s|\w+=)|'[^']*'(?=>|/|\s|\w+=))*(?=[<>])|.*?(?=[<>])''')
        def search(self,string,index=0):
            self.match = self.endbracket.match(string,index)
            if self.match: return self
        def start(self,n):
            return self.match.end(n)
    sgmllib.endbracket = EndBracketMatch()

This is a script to reproduce the error:

contents = '''<a><ar "none";                                                 
</a> '''                                                                     
import BeautifulSoup as BS                                                   
print 'before rawdog: ', len(BS.BeautifulSoup(contents)) # prints 4, correct answer
from rawdoglib import rawdog as rd                                           
print 'after rawdog: ', len(BS.BeautifulSoup(contents)) # prints 3, incorrect

It breaks on the "<" inside the "a" tag. In the OP's snippet, it is triggered by the line: for (var i=0; i<ar.length; i++){ (note the "<" char).

Issue submitted on rawdog's ML: http://lists.us-lot.org/pipermail/rawdog-users/2012-August/000327.html

Upvotes: 5

rsegal

Reputation: 401

I think the issue you're having is a chain of imports; that the two different places you're importing the BS package are conflicting.

This thread might be what you need, then.

(Also, BS package is a wonderful thing to be able to say in a serious context.)

Upvotes: 0

python conflicts in two external packages

Answers (3)

Related Questions