Juanse
Juanse

Reputation: 21

re.search() keeps returning None and can't find error

I'm coding an automatic reader for legal documents, in spanish.

By webscraping, I get this string:

'DECAD-2021-368-APN-JGM - Dase por designada Directora de Seguimiento y Evaluación de la Gestión.'

I checked with type() and it's a string, unicode (It can't be other way I understand).

Problem is I keep running this re.search(), that would allow me or not to continue with other process, and keeps returning None, and I don't understand what I'm doing wrong. I tried with and without the re.UNICODE flag also.

    if re.search(r"( Dase por designad[o]?[a]?)",str(b),re.UNICODE) != None:
    return "I'm gonna read it"
else:
    return "I'm not gonna read it"

note: [o]?[a]? is to match when text refers to male or female bureaucrats.

I try different version of regex:

"( Dase por designad[o]?[a]?)" without r before string.
"( Dase por designad)"
"Dase por designad"

I made a lot of re.search() for this project, but for some reason I'm stuck with this.

I think it must be a simple problem, I just can't see it.

Answering and adding requested information: I'm writing and testing this with Spyder 5, running on Anaconda, on Windows 10. Python 3.7.10

Blacknight: I checked hardcoding the string, and it works. Problem is it doesn't when the string comes from the return of this:

link = "/detalleAviso/primera/243131/20210419"  
url = f"https://www.boletinoficial.gob.ar{link}"
req = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
    
    
soup = BeautifulSoup(req.content, 'html.parser')
    
a =soup.find(id="tituloDetalleAviso")

b = a.find('h6').text
b = str(b)

If I get print b in console, copy-paste to hardcode and re-run I get a match. But doesn't when it comes from the webscraping.

I just run b == c: and returns False, where b is return from webscraping and c is the print of previous run of that webscrape.

I tried str(b) and doesn't work.

Upvotes: 1

Views: 1208

Answers (2)

Glyph
Glyph

Reputation: 31860

The accepted answer, already provided, answers the question directly. But what you're really asking here is "when this happens, how do I debug it?"

First, take care to detail your requirements. If someone shows up to this answer in 6 months time, will your imports work?

I did pip install bs4 requests in a virtualenv, and then pip freeze, and I got this:

$ python -V
Python 3.8.7
$ pip freeze
beautifulsoup4==4.9.3
bs4==0.0.1
certifi==2020.12.5
chardet==4.0.0
idna==2.10
requests==2.25.1
soupsieve==2.2.1
urllib3==1.26.4

Second, include a fully runnable example. Include your import lines, to show where you're importing BeautifulSoup, requests, etc. This saves a lot of time for answerers.

Third, you need to preserve the exact string that you're dealing with. Clearly on your computer, copy/pasting is doing some kind of whitespace normalization. I'm not sure why, on macOS Big Sur and Emacs I can clearly see the copied/pasted string has funky whitespace in it:

note the red underlines

Given that, you want to do something like this:

import base64

print(base64.b64encode(b.encode("utf-8")))
print(b)

This ASCII-armors your string value in such a way that it can be reconstructed bit-for-bit accurate, without relying on your operating system clipboard to leave it undamaged. You'll get a value like this:

b'REVDQUQtMjAyMS0zNjgtQVBOLUpHTSAtIERhc2UgcG9ywqBkZXNpZ25hZGEgRGlyZWN0b3JhIGRlwqBTZWd1aW1pZW50byB5wqBFdmFsdWFjacOzbiBkZcKgbGHCoEdlc3Rpw7NuLg=='

which you can then load back up with base64.b64decode(...).decode("utf-8") to ensure people can see the exact same thing, even if the web page being scraped changes.

Finally, you may want to investigate the string yourself, to understand exactly what these invisible characters are. Here's a quick program that can give you a good view of what is going on with invisibles, control characters, whitespace, etc, in a string, using the built-in unicodedata module:

import unicodedata

for character in text:
    print(repr(character), "-", unicodedata.name(character))

A snippet of the output from your string shows:

'p' - LATIN SMALL LETTER P
'o' - LATIN SMALL LETTER O
'r' - LATIN SMALL LETTER R
'\xa0' - NO-BREAK SPACE
'd' - LATIN SMALL LETTER D
'e' - LATIN SMALL LETTER E
's' - LATIN SMALL LETTER S

So you can see that all the funky spaces are no-break spaces.

Upvotes: 0

H. Pope
H. Pope

Reputation: 123

Using difflib to compare the raw strings typed and scraped, it highlighted some sort of difference between the spaces.

Changing the regex to recognize any whitespace character instead of just " " seems to have fixed it. The new regex being:

r"(\s*Dase\s*por\s*designad[o]?[a]?)"

For some reason leaving a single \s didn't fix it, it had to be set to more than one for a possible match. As quick fix I used * which is 0 to unlimited you may want to consider changing that.

Upvotes: 2

Related Questions