Dreams
Dreams

Reputation: 6122

Extracting formulas from WIkipedia pages - Python

I am extracting wikipedia pages and writing them in a file using python. Currently am doing this : code snippet :

keyWords = ["kinetic energy", "gravitational force"]

for word in keyWords:
    topic = wikipedia.page(word)
    text = topic.content
    print text

But the content has bad formatting when it displays formulas. For eg :

F = ma becomes something like :

F

   m

a

Can you help me figure out how can I have the mathematical formulas cleanly. Thank you!

Upvotes: 1

Views: 1518

Answers (1)

John Karasinski
John Karasinski

Reputation: 1006

There is no LaTeX in the Wikipedia pages grabbed by the Wikipedia module. In order to extract all the equations from a Wikipedia page you can take advantage of the BeautifulSoup package.

import wikipedia
from bs4 import BeautifulSoup

topic = wikipedia.page('kinetic energy')
equations = BeautifulSoup(topic.html()).find_all('annotation')

You can then extract the source of any given equation via

equations[0].text
#'{\\displaystyle {\\vec {F}}=m{\\vec {a}}}'

or

equations[0].text.split('{\\displaystyle ')[1][:-1]
#'{\\vec {F}}=m{\\vec {a}}'

though this is still not an entirely useful format. You should also note that there tends to be a lot of "one letter" equations that arise from references to a variable, so this might not be the best technique. What are you trying to accomplish?

Upvotes: 2

Related Questions