user4663715
user4663715

Reputation:

Cutting a part of string variable in python (web scraping)

Im trying to scrape a website, so I managed to extract all the text that I wanted, using this template:

nameList = bsObj.findAll("strong")
for text in nameList:
    string = text.get_text()
    if "Title" in string:
        print(text.get_text())

And I get the texts in this fashion:

Title 1: textthatineed

Title 2: textthatineed

Title 3: textthatineed

Title 4: textthatineed

Title 5: textthatineed

Title 6: textthatineed

Title 7: textthatineed ....

Is there any way that I can cut the string in python using beautifulsoup or any other way, and get only the "textthatineed" without "title(number): ".

Upvotes: 1

Views: 2341

Answers (2)

Apara
Apara

Reputation: 374

In Python, there is a very handy operation that can be done on strings called slicing.

An example taken from the docs

>>> word = 'Python'
>>> word[0:2]  # characters from position 0 (included) to 2 (excluded)
'Py'
>>> word[2:5]  # characters from position 2 (included) to 5 (excluded)
'tho'
>>> word[:2] + word[2:]
'Python'
>>> word[:4] + word[4:]
'Python'
>>> word[:2]   # character from the beginning to position 2 (excluded)
'Py'
>>> word[4:]   # characters from position 4 (included) to the end
'on'
>>> word[-2:]  # characters from the second-last (included) to the end
'on'

So in your case you would do something like this

text = 'Title 1: important information here'
#'Title 1: ' are the first 9 characters i.e., indices 0 through 8
#So you need to extract the information that begins at the 9th index
text = text[9:]

#For general cases
index = text.find(':') + 2
text = text[index:]

Upvotes: 1

ren
ren

Reputation: 270

Say we have

s = 'Title 1: textthatineed'

The title starts two characters after the colon, so we find the colon's index, move two characters down, and take the substring from that index to the end:

index = s.find(':') + 2
title = s[index:]

Note that find() only returns the index of the first occurrence, so titles containing colons are unaffected.

Upvotes: 1

Related Questions