user1328021
user1328021

Reputation: 9830

Writing to a file and getting weird indentation

I have the following snippet of code, which takes a url opens it, parses out JUST the text and then searches for widgets. The way it detects widgets is by looking for the word widget1 and then endwidget, which denotes the end of the widget.

Basically the code writes all lines of text to a file as soon as it finds the word widget1 and ends when it reads endwidget. However, my code is indenting all lines after the first widget1 line.

This is my output

widget1 this is a really cool widget
       it does x, y and z 
       and also a, b and c
       endwidget

What I want is:

widget1 this is a really cool widget
it does x, y and z 
and also a, b and c
endwidget

Why am I getting this indentation? This is my code...

 for url in urls:
        page = mech.open(url)
        html = page.read()
        soup = BeautifulSoup(html)
        text= soup.prettify()
        texts = soup.findAll(text=True) 

        def visible(element):
            if element.parent.name in ['style', 'script', '[document]', 'head', 'title']: 
            # If the parent of your element is any of those ignore it

                return False

            elif re.match('<!--.*-->', str(element)):
            # If the element matches an html tag, ignore it

                return False

            else:
            # Otherwise, return True as these are the elements we need

              return True

        visible_texts = filter(visible, texts)

        inwidget=0
        # open a file for write
        for line in visible_texts:
        # if line doesn't contain .widget1 then ignore it
            if ".widget1" in line and inwidget==0:
                match = re.search(r'\.widget1 (\w+)', line)
                line = line.split (".widget1")[1]   
                # make the next word after .widget1 the name of the file
                filename = "%s" % match.group(1) + ".txt"
                textfile = open (filename, 'w+b')
                textfile.write("source:" + url + "\n\n")
                textfile.write(".widget1" + line)
                inwidget = 1
            elif inwidget == 1 and ".endwidget" not in line:
                print line
                textfile.write(line)
            elif ".endwidget" in line and inwidget == 1:
                textfile.write(line)
                inwidget= 0
            else:
                pass

Upvotes: 0

Views: 339

Answers (2)

LtWorf
LtWorf

Reputation: 7600

To go from your output to your wanted output, do this:

#a is your output
a= '\n'.join(map(lambda x: x.strip(),a.split('\n')))

Upvotes: 0

user1767344
user1767344

Reputation: 166

The reason you're getting this indentation in all lines except for the first is because the first line you edit the line with textfile.write(".widget1" + line) but the rest of the lines you take directly from the html file where it contains indentation. You can remove the unwanted white spaces by using str.strip() on the lines and change textfile.write(line) to textfile.write(line.strip()).

Upvotes: 1

Related Questions