barracuda
barracuda

Reputation: 1058

Exclude whitespace from search pattern

I am trying to use a regular expression with findall(). The issue I'm having is that there is an unknown number of whitespace characters (spaces, tabs, linefeeds, carriage returns) in the patterns.

In the example bellow I want to use findall() to get the text inside <D> </D> whenever an </A> is found after the </D>. My problem is that there are whitespace characters after </D>.

In the example below I need to retrieve Second Text. The regular expression I have only works in there is no whitespace between </D> and </A>. This is what I tried:

regex = '<D>(.+?)</D></A>'

<A> 
   <B> Text </B> 
   <D> Second Text</D>
</A>

Upvotes: 1

Views: 720

Answers (4)

Aprillion
Aprillion

Reputation: 22340

if you need to match the whitespace between </D> and </A>:

regex = r'<D>(.+?)</D>\s*</A>'

notice using r'' raw string literal for regular expressions in python, to avoid double-escaping that would be needed in normal strings:

regex = '<D>(.+?)</D>\\s*</A>'

and to make . to match newlines, you can use the re.DOTALL flag for matching

Upvotes: 3

Learner
Learner

Reputation: 5302

It looks it is a portion of xml so it is better not to use regex here better try lxml, bs4 etc. BTW I tried a hybrid method i.e first select A tag and then select text inside D in that A.

import re
#let's take a string i.e. txt that is very roughest string that even doesnot maintain rules of xml

txt = """<A> 


   <B> Text </B> 


   <D> Second Text</D>


</A> line 3\n
<S> 
   <B> Text </B> 
   <D> Second Text</D>
</A>
<E> 
   <B> Text </B> 
   <D> Second Text</D>
</A>"""

A = re.findall(r'<A>[\w\W]*?</A>',txt)
print re.findall(r'(?<=<D>)[\w\W]*?(?=</D>)',''.join(A))

Output-

[' Second Text']

See Demo at HERE for the first expression and HERE for second regex

Upvotes: 0

PaulMcG
PaulMcG

Reputation: 63762

Pyparsing is not always recommended when HTML parsing libs like BeautifulSoup do such a nice job creating a document object representing the HTML page. But sometimes, you don't want the whole document, you just want to pick out snippets.

Regex is a really fragile extractor when scraping web pages, since whitespace can crop up in surprising places, tags sometimes get attributes when you don't expect them to, upper and lower case tag names are acceptable, and so on. Pyparsing's helper method makeHTMLTags(tagname) does more than just wrap <>s around the input string - it handles all the whitespace, letter case, and attribute variabilities, and still gives you a pretty readable program when you are done. Downside - pyparsing is not the snappiest in performance.

See the different examples in the input test, and the matches that are found:

test = """\
<A> 
   <B> Text </B> 
   <D> Second Text</D>
</A>
<A> 
   <B> Text </B> 
   <d extra_attribute='something'> another Text</d>
</A>
<A> 
   <B> Text </B> 
   <D> yet another Text</D>
   \t\t
</A>
"""

from pyparsing import makeHTMLTags, SkipTo, anyOpenTag, lineno, col

# makeHTMLTags will return patterns for both the opening and closing tags
d,d_end = makeHTMLTags('d')
a,a_end = makeHTMLTags('a')

# define the pattern you want to match
pattern = d + SkipTo(d_end, failOn=anyOpenTag)('body') + d_end + a_end

# use scanString to scan the input HTML, and get match,start,end triples
for match, loc, endloc in pattern.scanString(test):
    print '"%s" at line %d, col %d' % (match.body, lineno(loc, test), col(loc, test))

prints

"Second Text" at line 3, col 4
"another Text" at line 7, col 4
"yet another Text" at line 11, col 4

Upvotes: 1

Pouria
Pouria

Reputation: 1091

\. \* \\ escaped special characters.

\t \n \r tab, linefeed, carriage return.

\u00A9 unicode escaped ©.

For more information, and testing your regex, try this http://regexr.com/.

For what it's worth, in Python, you can also clear your text using my_string.strip('\t') or replace them with spaces with my_string.replace('\t', ' ').

Hope this helps.

Upvotes: 1

Related Questions