Reputation: 3031
Is it possible, using pywikipedia, to get just the text of the page, without any of the internal links or templates & without the pictures etc.?
Upvotes: 1
Views: 1694
Reputation: 333
Pywikibot is able to remove any wikitext or html tags. There are two functions inside textlib:
removeHTMLParts(text: str, keeptags=['tt', 'nowiki', 'small', 'sup']) -> str:
Return text without portions where HTML markup is disabled but keeps text between html tags. For example:
from pywikibot import textlib
text = 'This is <small>small</small> text'
print(textlib.removeHTMLParts(text, keeptags=[]))
this will print:
This is small text
removeDisabledParts(text: str, tags=None, include=[], site=None) -> str:
Return text without portions where wiki markup is disabled. This removes text inside wikitext text. For example:
from pywikibot import textlib
text = 'This is <small>small</small> text'
print(textlib.removeDisabledPartsParts(text, tags=['small']))
this will print:
This is text
There are a lot of predefined tags to be removed or to be kept like
'comment', 'header', 'link', 'template'
;
default for tags parameter is ['comment', 'includeonly', 'nowiki', 'pre', 'syntaxhighlight']
Some other examples:
removeDisabledPartsParts('See [[this link]]', tags=['link'])
gives 'See '
removeDisabledPartsParts('<!-- no comments -->', tags=['comment'])
gives ''
removeDisabledPartsParts('{{Infobox}}', tags=['template'])
gives ''
, but works only for Pywikibot 6.0.0 or higher
Upvotes: 0
Reputation: 1069
You can use wikitextparser. For example:
import pywikibot
import wikitextparser
en_wikipedia = pywikibot.Site('en', 'wikipedia')
text = pywikibot.Page(en_wikipedia,'Bla Bla Bla').get()
print(wikitextparser.parse(text).sections[0].plain_text())
will give you:
"Bla Bla Bla" is a song written and recorded by Italian DJ Gigi D'Agostino. It heavily samples the vocals of "Why did you do it?" by British band Stretch. It was released in May 1999 as the third single from the album, L'Amour Toujours. It reached number 3 in Austria and number 15 in France. It was sampled in the song "Jump" from Lupe Fiasco's 2017 album Drogas Light.
Upvotes: 0
Reputation: 2616
There is a module called mwparserfromhell on Github that can get you very close to what you want depending on what you need. It has a method called strip_code(), that strips a lot of the markup.
import pywikibot
import mwparserfromhell
test_wikipedia = pywikibot.Site('en', 'test')
text = pywikibot.Page(test_wikipedia, 'Lestat_de_Lioncourt').get()
full = mwparserfromhell.parse(text)
stripped = full.strip_code()
print full
print '*******************'
print stripped
Comparison snippet:
{{db-foreign}}
<!-- Commented out because image was deleted: [[Image:lestat_tom_cruise.jpg|thumb|right|[[Tom Cruise]] as Lestat in the film ''[[Interview With The Vampire: The Vampire Chronicles]]''|{{deletable image-caption|1=Friday, 11 April 2008}}]] -->
[[Image:lestat.jpg|thumb|right|[[Stuart Townsend]] as Lestat in the film ''[[Queen of the Damned (film)|Queen of the Damned]]'']]
[[Image:Lestat IWTV.jpg|thumb|right|[[Tom Cruise]] as Lestat in the 1994 film ''[[Interview with the Vampire (film)|Interview with the Vampire]]'']]
'''Lestat de Lioncourt''' is a [[fictional character]] appearing in several [[novel]]s by [[Anne Rice]], including ''[[The Vampire Lestat]]''. He is a [[vampire]] and the main character in the majority of ''[[The Vampire Chronicles]]'', narrated in first person.
==Publication history==
Lestat de Lioncourt is the narrator and main character of the majority of the novels in Anne Rice's ''The Vampire Chronicles'' series. ''[[The Vampire Lestat]]'', the second book in the series, is presented as Lestat's autobiography, and follows his exploits from his youth in France to his early years as a vampire. Many of the other books in the series are also credited as being written by Lestat.
*******************
thumb|right|Stuart Townsend as Lestat in the film ''Queen of the Damned''
'''Lestat de Lioncourt''' is a fictional character appearing in several novels by Anne Rice, including ''The Vampire Lestat''. He is a vampire and the main character in the majority of ''The Vampire Chronicles'', narrated in first person.
Publication history
Lestat de Lioncourt is the narrator and main character of the majority of the novels in Anne Rice's ''The Vampire Chronicles'' series. ''The Vampire Lestat'', the second book in the series, is presented as Lestat's autobiography, and follows his exploits from his youth in France to his early years as a vampire. Many of the other books in the series are also credited as being written by Lestat.
Upvotes: 1
Reputation: 7231
If you mean "I want to get the wikitext only", then look at the wikipedia.Page
class, and the get
method.
import wikipedia
site = wikipedia.getSite('en', 'wikipedia')
page = wikipedia.Page(site, 'Test')
print page.get() # '''Test''', '''TEST''' or '''Tester''' may refer to:
#==Science and technology==
#* [[Concept inventory]] - an assessment to reveal student thinking on a topic.
# ...
This way you get the complete, raw wikitext from the article.
If you want to strip out the wiki syntax, as is transform [[Concept inventory]]
into Concept inventory and so on, it is going to be a bit more painful.
The main reason for this trouble is that the MediaWiki wiki syntax has no defined grammar. Which makes it really hard to parse, and to strip. I currently know no software that allows you to do this accurately. There's the MediaWiki Parser class of course, but it's PHP, a bit hard to grasp, and its purpose is very very different.
But if you only want to strip out links, or very simple wiki constructs use regexes:
text = re.sub('\[\[([^\]\|]*)\]\]', '\\1', 'Lorem ipsum [[dolor]] sit amet, consectetur adipiscing elit.')
print text #Lorem ipsum dolor sit amet, consectetur adipiscing elit.
and then for piped links:
text = re.sub('\[\[(?:[^\]\|]*)\|([^\]\|]*)\]\]', '\\1', 'Lorem ipsum [[dolor|DOLOR]] sit amet, consectetur adipiscing elit.')
print text #Lorem ipsum DOLOR sit amet, consectetur adipiscing elit.
and so on.
But for example, there is no reliable easy way to strip out nested templates from a page. And the same goes for Images that have links in their comments. It's quite hard, and involves recursively removing the most internal link and replacing it by a marker and start over. Have a look at the templateWithParams
function in wikipedia.py if you want, but it's not pretty.
Upvotes: 5