Reputation:
Im trying to scrape a website, so I managed to extract all the text that I wanted, using this template:
nameList = bsObj.findAll("strong")
for text in nameList:
string = text.get_text()
if "Title" in string:
print(text.get_text())
And I get the texts in this fashion:
Title 1: textthatineed
Title 2: textthatineed
Title 3: textthatineed
Title 4: textthatineed
Title 5: textthatineed
Title 6: textthatineed
Title 7: textthatineed ....
Is there any way that I can cut the string in python using beautifulsoup or any other way, and get only the "textthatineed" without "title(number): ".
Upvotes: 1
Views: 2341
Reputation: 374
In Python, there is a very handy operation that can be done on strings called slicing.
An example taken from the docs
>>> word = 'Python'
>>> word[0:2] # characters from position 0 (included) to 2 (excluded)
'Py'
>>> word[2:5] # characters from position 2 (included) to 5 (excluded)
'tho'
>>> word[:2] + word[2:]
'Python'
>>> word[:4] + word[4:]
'Python'
>>> word[:2] # character from the beginning to position 2 (excluded)
'Py'
>>> word[4:] # characters from position 4 (included) to the end
'on'
>>> word[-2:] # characters from the second-last (included) to the end
'on'
So in your case you would do something like this
text = 'Title 1: important information here'
#'Title 1: ' are the first 9 characters i.e., indices 0 through 8
#So you need to extract the information that begins at the 9th index
text = text[9:]
#For general cases
index = text.find(':') + 2
text = text[index:]
Upvotes: 1
Reputation: 270
Say we have
s = 'Title 1: textthatineed'
The title starts two characters after the colon, so we find the colon's index, move two characters down, and take the substring from that index to the end:
index = s.find(':') + 2
title = s[index:]
Note that find()
only returns the index of the first occurrence, so titles containing colons are unaffected.
Upvotes: 1