Federico Fiore
Federico Fiore

Reputation: 13

How to extract text for "# Heading level 1" (header and its paragraphs) from markdown string/document with python?

I need to extract the text (header and its paragraphs) that match a header level 1 string passed to the python function. Below an example mardown text where I'm working:

# My first header

## Nec sic igni ad ad aventi

Lorem markdownum quantumque nunc, fine superi sagittis, haut regalis attollo,
ora inferius, mensor deam? Sedili quoque tauri. Quo limite ducem.

1. Arva fecit partes tosta
2. Insignia est ausae ut ut ait
3. O summa saepe

Sic ipsos, Phlegethontide nisi poterat neque quos tum partes rapitur. Filius
utraque: glande, ut exiles terram fiducia coeunt. Et caelo legit multis,
plangorem altoque; et iamque nec. Sanguine corpora prora quicquid insolida in
Parin: stupet est posses nos mater temptat, gemit num.

# My second header

## Primordia metuam his dixerat talaria cognoscenda

Lorem markdownum revulsum dilexit contra. Qui seu supplex Themis profuit quoque
Hyperionis, omnibus aesculus signa medendi. Aspiciunt rigidique finibus ducunt
postquam, huic postera lignum, properent.

- Nostro purgamina capitque longis
- Virtus suo moenibus
- Byblida longum pudibunda referre
- Via in ab vulneribus petita mirantur quamquam
- Et vela
- Nondum sacer meminisse Dircen novas dumque

For example I need to extract all the text of the header "My second header" from the above text.

I'm trying with regular expression but I didn't found a coorect rule for solve my problem.

def findHeader("My second header")
r = re.compile(r"the regular expression")
    print(r.findall(text))

findHeader output:

# My second header

## Primordia metuam his dixerat talaria cognoscenda

Lorem markdownum revulsum dilexit contra. Qui seu supplex Themis profuit quoque
Hyperionis, omnibus aesculus signa medendi. Aspiciunt rigidique finibus ducunt
postquam, huic postera lignum, properent.

- Nostro purgamina capitque longis
- Virtus suo moenibus
- Byblida longum pudibunda referre
- Via in ab vulneribus petita mirantur quamquam
- Et vela
- Nondum sacer meminisse Dircen novas dumque

Upvotes: 0

Views: 754

Answers (2)

Toto
Toto

Reputation: 91488

This does the job:

import re

text = """
# My first header

## Nec sic igni ad ad aventi

Lorem markdownum quantumque nunc, fine superi sagittis, haut regalis attollo,
ora inferius, mensor deam? Sedili quoque tauri. Quo limite ducem.

1. Arva fecit partes tosta
2. Insignia est ausae ut ut ait
3. O summa saepe

Sic ipsos, Phlegethontide nisi poterat neque quos tum partes rapitur. Filius
utraque: glande.

# My second header

## Primordia metuam his dixerat talaria cognoscenda

Lorem markdownum revulsum dilexit contra. Qui seu supplex Themis profuit quoque
Hyperionis, omnibus aesculus signa medendi.

- Nostro purgamina capitque longis
- Virtus suo moenibus

# My third header

## Primordia metuam his dixerat talaria cognoscenda

Lorem markdownum revulsum dilexit contra. Qui seu supplex Themis profuit quoque
postquam, huic postera lignum, properent.

"""
def findHeader(search):
    r = re.compile(r"(?<!#)# " + search + r"(?s)(?:(?!(?<!#)# ).)+")
    return(r.findall(text))
    
print(findHeader("My second header"))

Output:

['# My second header\n\n## Primordia metuam his dixerat talaria cognoscenda\n\nLorem markdownum revulsum dilexit contra. Qui seu supplex Themis profuit quoque\nHyperionis, omnibus aesculus signa medendi.\n\n- Nostro purgamina capitque longis\n- Virtus suo moenibus\n\n']

Explanation:

r"          # raw string
    (?<!#)      # negative lookbehind, make sure we haven't a # before
    #           # a # and a space
"           # end string
+           # concat
    search      # header to be searched
+           # concat
r"          # raw string
    (?s)        # . matches newline
    (?:         # non capture group (Tempered greedy token)
        (?!         # negative lookahead, mmake sure we haven't after:
            (?<!#)      # negative lookbehind, make sure we haven't a # before
            #           # a # and a space
        )           # end lookahead
        .           # any character including newline
    )+          # end group, may appear 1 or more times
"           # end string

Upvotes: 1

Federico Viscioletti
Federico Viscioletti

Reputation: 301

If I understand correctly, you are trying to capture only one # symbol at the beginning of each line.

The regular expression that helps you solve the issue is: r"(?:^|\s)(?:[#]\ )(.*\n+##\ ([^#]*\n)+)". The brackets isolate the capturing or non capturing groups. The first group (?:^|\s) is a non capturing group, because it starts with a question mark. Here you want that your matched string starts with the beginning of a line or a whitespace, then in the second group ([#]\ ), [#] will match exactly one # character. \ matches the space between the hash and the h1 tag text content. finally you want to match any possible character until the end of the line so you use the special characther ., which identifies any character, followed by + that will match any repetition of the previous matched character.

This is probably the code snippet you are looking for, I tested it with the same sample test you used.

import re

text = """
# My first header

## Nec sic igni ad ad aventi

Lorem markdownum quantumque nunc, fine superi sagittis, haut regalis attollo,
ora inferius, mensor deam? Sedili quoque tauri. Quo limite ducem.

1. Arva fecit partes tosta
2. Insignia est ausae ut ut ait
3. O summa saepe

Sic ipsos, Phlegethontide nisi poterat neque quos tum partes rapitur. Filius
utraque: glande, ut exiles terram fiducia coeunt. Et caelo legit multis,
plangorem altoque; et iamque nec. Sanguine corpora prora quicquid insolida in
Parin: stupet est posses nos mater temptat, gemit num.

# My second header

## Primordia metuam his dixerat talaria cognoscenda

Lorem markdownum revulsum dilexit contra. Qui seu supplex Themis profuit quoque
Hyperionis, omnibus aesculus signa medendi. Aspiciunt rigidique finibus ducunt
postquam, huic postera lignum, properent.

- Nostro purgamina capitque longis
- Virtus suo moenibus
- Byblida longum pudibunda referre
- Via in ab vulneribus petita mirantur quamquam
- Et vela
- Nondum sacer meminisse Dircen novas dumque
"""

r = re.compile(r"r"(?:^|\s)(?:[#]\ )(.*\n+##\ ([^#]*\n)+)"")
print(r.findall(text))

If you just want to extract the paragraph text, then you can use this regex: r"(?:^|\s)(?:[#]\ )(.+)" which is similar to the previous one, but it just removes the # symbol from the capturing group

Upvotes: 0

Related Questions