Donal Rafferty
Donal Rafferty

Reputation: 19826

How to remove text between two double brackets in Python

I am taking some markdown, turning it into html, then parsing out text without tags to leave me with a clean set of alphanumeric characters only.

The problem is the markdown has some custom components it it that I am having trouble parsing out.

Here is an example:

{{< custom type="phase1" >}}
    Some Text in here (I want to keep this)
{{< /custom >}}

I want to be able to delete everything in between the {{ & }} brackets (including the brackets), while keeping the text in between the first and second instance. Essentially, I just want to be able remove all instances of {{ *? }} in the file. There can be any number in a given file.

Here is what I have tried:

def clean_markdown(self, text_string):
  html = markdown.markdown(text_string)
  soup = BeautifulSoup(html, features="html.parser")
  # to_extract = soup.findAll('script') //Tried to extract via soup but no joy as not tags
  cleaned = re.sub(r'([^-.\s\w])+', '', soup.text)
  return cleaned

This works well for everything in the markdown except it leaves the value in the text that is between the {{ & }}. So, in this case the word "custom" will be in my cleaned text, but I don't want it to be.

As you can see, I tried to extract using beautiful soup but it didn't work as the start value ({{) is different to the end value (}})

Does anyone have any ideas how to efficiently implement a parser in Python that would clean this?

Upvotes: 1

Views: 476

Answers (3)

maor10
maor10

Reputation: 1784

Using a regex match should work well:

def clean_markdown(self, text_string):
    html = markdown.markdown(text_string)
    soup = BeautifulSoup(html, features="html.parser")
    # to_extract = soup.findAll('script') //Tried to extract via soup but no joy as not tags
    match = re.match("{{.+}}\n(?P<text>.*)\n{{.+}}", soup.text, re.MULTILINE)
    cleaned = match.groupdict()['text']
    return cleaned

Upvotes: 1

Shubham Sharma
Shubham Sharma

Reputation: 71689

IIUC: Try this:

result = re.sub(r"\{\{.*?\}\}", "", string).strip()
print(result)

Output:

Some Text in here (I want to keep this)

Upvotes: 1

Alain T.
Alain T.

Reputation: 42133

If I understand what you are trying to do correctly, you should be able to use re.sub to replace all the {{...}} patterns with an empty string directly in the text_tring parameter

def clean_markdown(self, text_string): 
    return re.sub("{{.*}}","",text_string)

Upvotes: 1

Related Questions