Reputation: 9830
I have the following snippet of code, which takes a url opens it, parses out JUST the text and then searches for widgets. The way it detects widgets is by looking for the word widget1
and then endwidget
, which denotes the end of the widget.
Basically the code writes all lines of text to a file as soon as it finds the word widget1
and ends when it reads endwidget
. However, my code is indenting all lines after the first widget1
line.
This is my output
widget1 this is a really cool widget
it does x, y and z
and also a, b and c
endwidget
What I want is:
widget1 this is a really cool widget
it does x, y and z
and also a, b and c
endwidget
Why am I getting this indentation? This is my code...
for url in urls:
page = mech.open(url)
html = page.read()
soup = BeautifulSoup(html)
text= soup.prettify()
texts = soup.findAll(text=True)
def visible(element):
if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
# If the parent of your element is any of those ignore it
return False
elif re.match('<!--.*-->', str(element)):
# If the element matches an html tag, ignore it
return False
else:
# Otherwise, return True as these are the elements we need
return True
visible_texts = filter(visible, texts)
inwidget=0
# open a file for write
for line in visible_texts:
# if line doesn't contain .widget1 then ignore it
if ".widget1" in line and inwidget==0:
match = re.search(r'\.widget1 (\w+)', line)
line = line.split (".widget1")[1]
# make the next word after .widget1 the name of the file
filename = "%s" % match.group(1) + ".txt"
textfile = open (filename, 'w+b')
textfile.write("source:" + url + "\n\n")
textfile.write(".widget1" + line)
inwidget = 1
elif inwidget == 1 and ".endwidget" not in line:
print line
textfile.write(line)
elif ".endwidget" in line and inwidget == 1:
textfile.write(line)
inwidget= 0
else:
pass
Upvotes: 0
Views: 339
Reputation: 7600
To go from your output to your wanted output, do this:
#a is your output
a= '\n'.join(map(lambda x: x.strip(),a.split('\n')))
Upvotes: 0
Reputation: 166
The reason you're getting this indentation in all lines except for the first is because the first line you edit the line with textfile.write(".widget1" + line)
but the rest of the lines you take directly from the html file where it contains indentation. You can remove the unwanted white spaces by using str.strip() on the lines and change textfile.write(line)
to textfile.write(line.strip())
.
Upvotes: 1