Reputation: 19826
I am taking some markdown, turning it into html, then parsing out text without tags to leave me with a clean set of alphanumeric characters only.
The problem is the markdown has some custom components it it that I am having trouble parsing out.
Here is an example:
{{< custom type="phase1" >}}
Some Text in here (I want to keep this)
{{< /custom >}}
I want to be able to delete everything in between the {{ & }} brackets (including the brackets), while keeping the text in between the first and second instance. Essentially, I just want to be able remove all instances of {{ *? }} in the file. There can be any number in a given file.
Here is what I have tried:
def clean_markdown(self, text_string):
html = markdown.markdown(text_string)
soup = BeautifulSoup(html, features="html.parser")
# to_extract = soup.findAll('script') //Tried to extract via soup but no joy as not tags
cleaned = re.sub(r'([^-.\s\w])+', '', soup.text)
return cleaned
This works well for everything in the markdown except it leaves the value in the text that is between the {{ & }}. So, in this case the word "custom" will be in my cleaned text, but I don't want it to be.
As you can see, I tried to extract using beautiful soup but it didn't work as the start value ({{) is different to the end value (}})
Does anyone have any ideas how to efficiently implement a parser in Python that would clean this?
Upvotes: 1
Views: 476
Reputation: 1784
Using a regex match should work well:
def clean_markdown(self, text_string):
html = markdown.markdown(text_string)
soup = BeautifulSoup(html, features="html.parser")
# to_extract = soup.findAll('script') //Tried to extract via soup but no joy as not tags
match = re.match("{{.+}}\n(?P<text>.*)\n{{.+}}", soup.text, re.MULTILINE)
cleaned = match.groupdict()['text']
return cleaned
Upvotes: 1
Reputation: 71689
IIUC: Try this:
result = re.sub(r"\{\{.*?\}\}", "", string).strip()
print(result)
Output:
Some Text in here (I want to keep this)
Upvotes: 1
Reputation: 42133
If I understand what you are trying to do correctly, you should be able to use re.sub to replace all the {{...}} patterns with an empty string directly in the text_tring parameter
def clean_markdown(self, text_string):
return re.sub("{{.*}}","",text_string)
Upvotes: 1