Reputation: 13
I need to extract the text (header and its paragraphs) that match a header level 1 string passed to the python function. Below an example mardown text where I'm working:
# My first header
## Nec sic igni ad ad aventi
Lorem markdownum quantumque nunc, fine superi sagittis, haut regalis attollo,
ora inferius, mensor deam? Sedili quoque tauri. Quo limite ducem.
1. Arva fecit partes tosta
2. Insignia est ausae ut ut ait
3. O summa saepe
Sic ipsos, Phlegethontide nisi poterat neque quos tum partes rapitur. Filius
utraque: glande, ut exiles terram fiducia coeunt. Et caelo legit multis,
plangorem altoque; et iamque nec. Sanguine corpora prora quicquid insolida in
Parin: stupet est posses nos mater temptat, gemit num.
# My second header
## Primordia metuam his dixerat talaria cognoscenda
Lorem markdownum revulsum dilexit contra. Qui seu supplex Themis profuit quoque
Hyperionis, omnibus aesculus signa medendi. Aspiciunt rigidique finibus ducunt
postquam, huic postera lignum, properent.
- Nostro purgamina capitque longis
- Virtus suo moenibus
- Byblida longum pudibunda referre
- Via in ab vulneribus petita mirantur quamquam
- Et vela
- Nondum sacer meminisse Dircen novas dumque
For example I need to extract all the text of the header "My second header" from the above text.
I'm trying with regular expression but I didn't found a coorect rule for solve my problem.
def findHeader("My second header")
r = re.compile(r"the regular expression")
print(r.findall(text))
findHeader output:
# My second header
## Primordia metuam his dixerat talaria cognoscenda
Lorem markdownum revulsum dilexit contra. Qui seu supplex Themis profuit quoque
Hyperionis, omnibus aesculus signa medendi. Aspiciunt rigidique finibus ducunt
postquam, huic postera lignum, properent.
- Nostro purgamina capitque longis
- Virtus suo moenibus
- Byblida longum pudibunda referre
- Via in ab vulneribus petita mirantur quamquam
- Et vela
- Nondum sacer meminisse Dircen novas dumque
Upvotes: 0
Views: 754
Reputation: 91488
This does the job:
import re
text = """
# My first header
## Nec sic igni ad ad aventi
Lorem markdownum quantumque nunc, fine superi sagittis, haut regalis attollo,
ora inferius, mensor deam? Sedili quoque tauri. Quo limite ducem.
1. Arva fecit partes tosta
2. Insignia est ausae ut ut ait
3. O summa saepe
Sic ipsos, Phlegethontide nisi poterat neque quos tum partes rapitur. Filius
utraque: glande.
# My second header
## Primordia metuam his dixerat talaria cognoscenda
Lorem markdownum revulsum dilexit contra. Qui seu supplex Themis profuit quoque
Hyperionis, omnibus aesculus signa medendi.
- Nostro purgamina capitque longis
- Virtus suo moenibus
# My third header
## Primordia metuam his dixerat talaria cognoscenda
Lorem markdownum revulsum dilexit contra. Qui seu supplex Themis profuit quoque
postquam, huic postera lignum, properent.
"""
def findHeader(search):
r = re.compile(r"(?<!#)# " + search + r"(?s)(?:(?!(?<!#)# ).)+")
return(r.findall(text))
print(findHeader("My second header"))
Output:
['# My second header\n\n## Primordia metuam his dixerat talaria cognoscenda\n\nLorem markdownum revulsum dilexit contra. Qui seu supplex Themis profuit quoque\nHyperionis, omnibus aesculus signa medendi.\n\n- Nostro purgamina capitque longis\n- Virtus suo moenibus\n\n']
Explanation:
r" # raw string
(?<!#) # negative lookbehind, make sure we haven't a # before
# # a # and a space
" # end string
+ # concat
search # header to be searched
+ # concat
r" # raw string
(?s) # . matches newline
(?: # non capture group (Tempered greedy token)
(?! # negative lookahead, mmake sure we haven't after:
(?<!#) # negative lookbehind, make sure we haven't a # before
# # a # and a space
) # end lookahead
. # any character including newline
)+ # end group, may appear 1 or more times
" # end string
Upvotes: 1
Reputation: 301
If I understand correctly, you are trying to capture only one # symbol at the beginning of each line.
The regular expression that helps you solve the issue is: r"(?:^|\s)(?:[#]\ )(.*\n+##\ ([^#]*\n)+)"
. The brackets isolate the capturing or non capturing groups. The first group (?:^|\s)
is a non capturing group, because it starts with a question mark. Here you want that your matched string starts with the beginning of a line or a whitespace, then in the second group ([#]\ )
, [#]
will match exactly one # character. \
matches the space between the hash and the h1 tag text content. finally you want to match any possible character until the end of the line so you use the special characther .
, which identifies any character, followed by +
that will match any repetition of the previous matched character.
This is probably the code snippet you are looking for, I tested it with the same sample test you used.
import re
text = """
# My first header
## Nec sic igni ad ad aventi
Lorem markdownum quantumque nunc, fine superi sagittis, haut regalis attollo,
ora inferius, mensor deam? Sedili quoque tauri. Quo limite ducem.
1. Arva fecit partes tosta
2. Insignia est ausae ut ut ait
3. O summa saepe
Sic ipsos, Phlegethontide nisi poterat neque quos tum partes rapitur. Filius
utraque: glande, ut exiles terram fiducia coeunt. Et caelo legit multis,
plangorem altoque; et iamque nec. Sanguine corpora prora quicquid insolida in
Parin: stupet est posses nos mater temptat, gemit num.
# My second header
## Primordia metuam his dixerat talaria cognoscenda
Lorem markdownum revulsum dilexit contra. Qui seu supplex Themis profuit quoque
Hyperionis, omnibus aesculus signa medendi. Aspiciunt rigidique finibus ducunt
postquam, huic postera lignum, properent.
- Nostro purgamina capitque longis
- Virtus suo moenibus
- Byblida longum pudibunda referre
- Via in ab vulneribus petita mirantur quamquam
- Et vela
- Nondum sacer meminisse Dircen novas dumque
"""
r = re.compile(r"r"(?:^|\s)(?:[#]\ )(.*\n+##\ ([^#]*\n)+)"")
print(r.findall(text))
If you just want to extract the paragraph text, then you can use this regex: r"(?:^|\s)(?:[#]\ )(.+)"
which is similar to the previous one, but it just removes the # symbol from the capturing group
Upvotes: 0