James Wanchai
James Wanchai

Reputation: 3031

Can I use pywikipedia to get just the text of a page?

Is it possible, using pywikipedia, to get just the text of the page, without any of the internal links or templates & without the pictures etc.?

Upvotes: 1

Views: 1694

Answers (4)

xqt
xqt

Reputation: 333

Pywikibot is able to remove any wikitext or html tags. There are two functions inside textlib:

  1. removeHTMLParts(text: str, keeptags=['tt', 'nowiki', 'small', 'sup']) -> str:

    Return text without portions where HTML markup is disabled but keeps text between html tags. For example:

     from pywikibot import textlib
     text = 'This is <small>small</small> text'
     print(textlib.removeHTMLParts(text, keeptags=[]))
    

    this will print:

     This is small text
    
  2. removeDisabledParts(text: str, tags=None, include=[], site=None) -> str: Return text without portions where wiki markup is disabled. This removes text inside wikitext text. For example:

     from pywikibot import textlib
     text = 'This is <small>small</small> text'
     print(textlib.removeDisabledPartsParts(text, tags=['small']))
    

    this will print:

     This is  text
    

    There are a lot of predefined tags to be removed or to be kept like 'comment', 'header', 'link', 'template';

    default for tags parameter is ['comment', 'includeonly', 'nowiki', 'pre', 'syntaxhighlight']

    Some other examples:

    removeDisabledPartsParts('See [[this link]]', tags=['link']) gives 'See ' removeDisabledPartsParts('<!-- no comments -->', tags=['comment']) gives '' removeDisabledPartsParts('{{Infobox}}', tags=['template']) gives '', but works only for Pywikibot 6.0.0 or higher

Upvotes: 0

oktieh
oktieh

Reputation: 1069

You can use wikitextparser. For example:

import pywikibot
import wikitextparser
en_wikipedia = pywikibot.Site('en', 'wikipedia')
text = pywikibot.Page(en_wikipedia,'Bla Bla Bla').get()
print(wikitextparser.parse(text).sections[0].plain_text())

will give you:

"Bla Bla Bla" is a song written and recorded by Italian DJ Gigi D'Agostino. It heavily samples the vocals of "Why did you do it?" by British band Stretch. It was released in May 1999 as the third single from the album, L'Amour Toujours. It reached number 3 in Austria and number 15 in France. It was sampled in the song "Jump" from Lupe Fiasco's 2017 album Drogas Light.

Upvotes: 0

notconfusing
notconfusing

Reputation: 2616

There is a module called mwparserfromhell on Github that can get you very close to what you want depending on what you need. It has a method called strip_code(), that strips a lot of the markup.

import pywikibot
import mwparserfromhell

test_wikipedia = pywikibot.Site('en', 'test')
text = pywikibot.Page(test_wikipedia, 'Lestat_de_Lioncourt').get()

full = mwparserfromhell.parse(text)
stripped = full.strip_code()

print full
print '*******************'
print stripped

Comparison snippet:

{{db-foreign}}
<!--  Commented out because image was deleted: [[Image:lestat_tom_cruise.jpg|thumb|right|[[Tom Cruise]] as Lestat in the film ''[[Interview With The Vampire: The Vampire Chronicles]]''|{{deletable image-caption|1=Friday, 11 April 2008}}]] -->

[[Image:lestat.jpg|thumb|right|[[Stuart Townsend]] as Lestat in the film ''[[Queen of the Damned (film)|Queen of the Damned]]'']]

[[Image:Lestat IWTV.jpg|thumb|right|[[Tom Cruise]] as Lestat in the 1994 film ''[[Interview with the Vampire (film)|Interview with the Vampire]]'']]

'''Lestat de Lioncourt''' is a [[fictional character]] appearing in several [[novel]]s by [[Anne Rice]], including ''[[The Vampire Lestat]]''. He is a [[vampire]] and the main character in the majority of ''[[The Vampire Chronicles]]'', narrated in first person.   

==Publication history==
Lestat de Lioncourt is the narrator and main character of the majority of the novels in Anne Rice's ''The Vampire Chronicles'' series. ''[[The Vampire Lestat]]'', the second book in the series, is presented as Lestat's autobiography, and follows his exploits from his youth in France to his early years as a vampire. Many of the other books in the series are also credited as being written by Lestat. 


*******************

thumb|right|Stuart Townsend as Lestat in the film ''Queen of the Damned''

'''Lestat de Lioncourt''' is a fictional character appearing in several novels by Anne Rice, including ''The Vampire Lestat''. He is a vampire and the main character in the majority of ''The Vampire Chronicles'', narrated in first person.   

Publication history
Lestat de Lioncourt is the narrator and main character of the majority of the novels in Anne Rice's ''The Vampire Chronicles'' series. ''The Vampire Lestat'', the second book in the series, is presented as Lestat's autobiography, and follows his exploits from his youth in France to his early years as a vampire. Many of the other books in the series are also credited as being written by Lestat. 

Upvotes: 1

Nicolas Dumazet
Nicolas Dumazet

Reputation: 7231

If you mean "I want to get the wikitext only", then look at the wikipedia.Page class, and the get method.

import wikipedia

site = wikipedia.getSite('en', 'wikipedia')
page = wikipedia.Page(site, 'Test')

print page.get() # '''Test''', '''TEST''' or '''Tester''' may refer to:
#==Science and technology==
#* [[Concept inventory]] - an assessment to reveal student thinking on a topic.
# ...

This way you get the complete, raw wikitext from the article.

If you want to strip out the wiki syntax, as is transform [[Concept inventory]] into Concept inventory and so on, it is going to be a bit more painful.

The main reason for this trouble is that the MediaWiki wiki syntax has no defined grammar. Which makes it really hard to parse, and to strip. I currently know no software that allows you to do this accurately. There's the MediaWiki Parser class of course, but it's PHP, a bit hard to grasp, and its purpose is very very different.

But if you only want to strip out links, or very simple wiki constructs use regexes:

text = re.sub('\[\[([^\]\|]*)\]\]', '\\1', 'Lorem ipsum [[dolor]] sit amet, consectetur adipiscing elit.')
print text #Lorem ipsum dolor sit amet, consectetur adipiscing elit.

and then for piped links:

text = re.sub('\[\[(?:[^\]\|]*)\|([^\]\|]*)\]\]', '\\1', 'Lorem ipsum [[dolor|DOLOR]] sit amet, consectetur adipiscing elit.')
print text #Lorem ipsum DOLOR sit amet, consectetur adipiscing elit.

and so on.

But for example, there is no reliable easy way to strip out nested templates from a page. And the same goes for Images that have links in their comments. It's quite hard, and involves recursively removing the most internal link and replacing it by a marker and start over. Have a look at the templateWithParams function in wikipedia.py if you want, but it's not pretty.

Upvotes: 5

Related Questions