Reputation: 5266
I'm looking for something like this:
data = '''
**this is some markdown**
blah blah blah
![image here](http://somewebsite.com/image1.jpg)
![another image here](http://anotherwebsite.com/image2.jpg)
'''
print get_images_url_from_markdown(data)
that returns a list of image URLs from the text:
['http://somewebsite.com/image1.jpg', 'http://anotherwebsite.com/image2.jpg']
Is there anything available, or do I have to scrape Markdown myself with BeautifulSoup?
Upvotes: 6
Views: 4058
Reputation: 42497
Python-Markdown has an extensive Extension API. In fact, the Table of Contents Extension does essentially what you want with headings (instead of images) plus a bunch of other stuff you don't need (like adding unique id attributes and building a nested list for the TOC).
After the document is parsed, it is contained in an ElementTree object and you can use a treeprocessor to extract the data you want before the tree is serialized to text. Just be aware that if you have included any images as raw HTML, this will fail to find those images (you would need to parse the HTML output and extract in that case).
Start off by following this tutorial, except that you will need to create a treeprocessor
rather than an inline Pattern
. You should end up with something like this:
import markdown
from markdown.treeprocessors import Treeprocessor
from markdown.extensions import Extension
# First create the treeprocessor
class ImgExtractor(Treeprocessor):
def run(self, doc):
"Find all images and append to markdown.images. "
self.md.images = []
for image in doc.findall('.//img'):
self.md.images.append(image.get('src'))
# Then tell markdown about it
class ImgExtExtension(Extension):
def extendMarkdown(self, md):
img_ext = ImgExtractor(md)
md.treeprocessors.register(img_ext, 'img_ext', 15)
# Finally create an instance of the Markdown class with the new extension
md = markdown.Markdown(extensions=[ImgExtExtension()])
# Now let's test it out:
data = '''
**this is some markdown**
blah blah blah
![image here](http://somewebsite.com/image1.jpg)
![another image here](http://anotherwebsite.com/image2.jpg)
'''
html = md.convert(data)
print(md.images)
The above outputs:
[u'http://somewebsite.com/image1.jpg', u'http://anotherwebsite.com/image2.jpg']
If you really want a function which returns the list, just wrap that all up in one and you're good to go.
Upvotes: 15
Reputation: 3970
Using BeautifulSoup makes this quite simple.
import markdown
from bs4 import BeautifulSoup
import sys
filename = "myfile.md"
f = open(filename, 'r')
html = markdown.markdown( f.read() )
f.close()
soup = BeautifulSoup(html, 'html.parser')
img = []
for link in soup.find_all('img'):
img.append(link.get('src'))
print(img)
Upvotes: 0
Reputation: 106
In cases where you only need the URLs and won't benefit from the functionality provided by a full-blown markdown parser, I'd suggest just implementing a simple parsing algorithm. As a result, you won't be reliant on an external dependency and it will most-likely (keep in mind, pure Python is slow) be faster than a full-blown parser.
Something like this:
from collections.abc import Iterable
import urllib
def find_image_urls_in_markdown(s: str) -> Iterable[str]:
# The following code:
# - Finds substrings that look like "![...](...)".
# - Stores the contents of "[...]" in `alt_text_chars`.
# - Stores the contents of "(...)" in `url_chars`.
# - Joins `alt_text_chars` together to a string that is used
# as the input for a recursive call to the generator,
# thus finding nested image URLs in the alt text.
# - **Joins `url_chars` together and 'yields' the (trimmed)
# result as the generator's output**.
context = ""
alt_text_chars = []
url_chars = []
for char in s:
if context == "":
# Ensure substring starts with '!'.
if char == "!":
context = "after_exclamation"
elif context == "after_exclamation":
# Ensure '!' is followed by '['.
if char == "[":
context = "alt_text"
elif context == "alt_text":
# Ensure alt text section ends with ']' in '![...]'.
if char == "]":
context = "alt_text_end"
else:
# Record alt text in `alt_text_chars`.
alt_text_chars.append(char)
elif context == "alt_text_end":
# Ensure alt text section '![...]' is followed by '('.
if char == "(":
context = "url"
else:
# Otherwise, discard recorded alt text
# and reset parser to initial state.
alt_text_chars.clear()
context = ""
elif context == "url":
# Ensure URL section ends with ')' in '![...](...)'.
if char == ")":
# 'yield' URLs that were contained in the alt text.
yield from find_image_urls_in_markdown("".join(alt_text_chars))
# 'yield' the URL in the current substring.
yield "".join(url_chars).strip()
# Use next line instead to escape special chars
#yield urllib.parse.quote("".join(url_chars).strip().encode('utf-8'), safe="/:")
# Reset the parser to its initial state.
alt_text_chars.clear()
url_chars.clear()
context = ""
else:
# Record URL in `url_chars`.
url_chars.append(char)
my_markdown_string: str = "..."
image_url_list: list[str] = list(find_image_urls_in_markdown(my_markdown_string))
Gist here: https://gist.github.com/0scarB/ed330536fd098ccdca9b7a8b09d9d4a0.
Note: The code will not work for reference-style images as documented here https://daringfireball.net/projects/markdown/syntax#img.
(I know this question is pretty old - just recently ran into the same problem myself)
Upvotes: 1