miles fryett
miles fryett

Reputation: 23

Python how to split string at "→" Char

What I'm doing is webscraping using selenium and python 3, I'm getting text from a data box and need to split it at "→" character, I'm lost on how to do that. The line I have is whole_text.split("→") and an example value for 'whole_text' is

"He let them in and locked the door, leaving the big, bad wolf outside trying to blow the brick house down. → Climax, One day, a big, bad wolf came along and blew down the straw house! → Rising Action,"

I think it has something to do with the encoding, when I pasted "→" into wing ide it told me to pick and encoding, I picked UTF-8 but I ran into a similar problem earlier with the same website where python ran an exception about encoding when I copy and pasted the "-" character in and it was slightly longer. How can I go about converting this value or even seeing what encoding the char is.

I should also mention for what I'm doing, something like split at any non ASCII character would also work

Edit: (more code)

def setmatch(self, soup):
    r_soup = soup.find('div', attrs = {'class' : 'rightanswer'})
    hole_text = r_soup.get_text()
    hole_text = hole_text[23:]
    #hole_text = self.make_unicode(hole_text)
    hole_text.split("→")
    for i in range(0, len(hole_text), 2):
        Op = hole_text[i]
        ans = hole_text[i+1]
    Op_ans = zip(op, ans)
    self.options_match = Op_ans

Upvotes: 0

Views: 801

Answers (1)

Panagiotis Kanavos
Panagiotis Kanavos

Reputation: 131180

I have no problem splitting what you posted. If I copy that string into a Python console and split, the array has 3 elements.

>>> whole_words="He let them in and locked the door, leaving the big, bad wolf outside trying to blow the brick house down. → Climax, One day, a big, bad wolf came along and blew down the straw house! → Rising Action,"
>>> arr=whole_text.split("→")
>>> len(arr)
3
>>> arr[2]
' Rising Action,'

Python 3 strings are Unicode. Almost all web sites use UTF8, so Unicode and UTF8 aren't a special case, they're the default. Unicode was always the default and only option in other languages like Java, JavaScript, all .NET languages like C#, F# and even in the ancient pre-.NET Visual Basic.

In web scraping, UTF8 is almost never the problem.

Most likely, what you try to split contains an HTML escape sequence like →. Like the \n escape sequence in Python strings, this will render as a newline on screen and even result in a newline if you copy the output. If you try to split the source though, there won't be any newline.

Another possibility is that the web page uses FontAwesome or a similar font and CSS classes to render arrows instead of escape sequences, eg :

<i class="fas fa-arrow-right"></i>

You can find lists of HTML escape sequences in many places, eg a list of arrow characters. Just like arrow characters, there are many different dashes as well. That's why you can find longer or shorter dashes in documents and HTML pages.

You'll have to inspect the source of the web page that's causing you trouble to find out what the source looks like. Since you use Selenium, you'll have to inspect the actual string returned by get_Text() too.

Upvotes: 1

Related Questions