Reputation: 6122
I am extracting wikipedia pages and writing them in a file using python. Currently am doing this : code snippet :
keyWords = ["kinetic energy", "gravitational force"]
for word in keyWords:
topic = wikipedia.page(word)
text = topic.content
print text
But the content has bad formatting when it displays formulas. For eg :
F = ma becomes something like :
F
m
a
Can you help me figure out how can I have the mathematical formulas cleanly. Thank you!
Upvotes: 1
Views: 1518
Reputation: 1006
There is no LaTeX in the Wikipedia pages grabbed by the Wikipedia module. In order to extract all the equations from a Wikipedia page you can take advantage of the BeautifulSoup package.
import wikipedia
from bs4 import BeautifulSoup
topic = wikipedia.page('kinetic energy')
equations = BeautifulSoup(topic.html()).find_all('annotation')
You can then extract the source of any given equation via
equations[0].text
#'{\\displaystyle {\\vec {F}}=m{\\vec {a}}}'
or
equations[0].text.split('{\\displaystyle ')[1][:-1]
#'{\\vec {F}}=m{\\vec {a}}'
though this is still not an entirely useful format. You should also note that there tends to be a lot of "one letter" equations that arise from references to a variable, so this might not be the best technique. What are you trying to accomplish?
Upvotes: 2