Vasudev Ram: Using the wikipedia Python library (to search for oranges :)

来源:互联网 时间:1970-01-01

By Vasudev Ram

Orange and orange juice image from Wikimedia Commons.

I had come across the wikipedia Python library some time ago. Note that I said "wikipedia Python library ", not "Wikipedia Python API ". That's because wikipedia is a Python library that wraps the Wikipedia API , providing a somewhat higher-level / easier-to-use interface to the programmer.

Here are a few basic ways of using this library:

First, install it with this command at your OS command line:

$ pip install wikipedia

(I am using $ as the command line prompt, so don't type it.)

Now the Python code snippets:

Import the library:

import wikipedia

Use the .page() method to saerch for a Wikipedia page:

print "1: Searching Wikipedia for 'Orange'"try: print'Orange')except wikipedia.exceptions.DisambiguationError as e: print str(e) print 'DisambiguationError: The page name is ambiguous'print

The output (partly truncated) is:

1: Searching Wikipedia for 'Orange'"Orange" may refer to:Orange (colour)Orange (fruit)Some other citrus or citrus-like fruitOrange (manga)Orange (2010 film)Orange (2012 film)Oranges (film)The Oranges (film)Orange Record LabelOrange (band)Orange (Al Stewart album)Orange (Jon Spencer Blues Explosion album)"Orange" (song)Between the Eyes"L'Orange" (song)DisambiguationError: The page name is ambiguous

Next, use the .page method with one of the results from above, which are actual page titles:

print "2: Searching Wikipedia for 'Orange_(fruit)'"print'Orange_(fruit)')

The output may not be what one expects:

2: Searching Wikipedia for 'Orange (fruit)'<WikipediaPage 'Orange (fruit)'>

That'ss because the return value from the above call is a WikipediaPage object , not the page content itself. To get the content we want, we have to access the 'content' attrbute of the WikipediaPage object:


However, if we access it directly, we may get a Unicode error , so we encode it to UTF-8 :

result ='Orange_(fruit)').content.encode('UTF8')print "3: Result of searching Wikipedia for 'Orange_(fruit)':"print resultorange_count = result.count('orange')printprint "The Wikipedia page for 'Orange_(fruit)' has " + / "{} occurrences of the word 'orange'".format(orange_count)

Here are the first few lines of the output, followed by the count at the end:

3: Result of searching Wikipedia for 'Orange_(fruit)':The orange (specifically, the sweet orange) is the fruit of the citrus species Citrus × sinensis in the family Rutaceae.The fruit of the Citrus × sinensis is considered a sweet orange, whereas the fruit of the Citrus × aurantium is considered a bitter orange. The sweet orange reproduces asexually (apomixis through nucellar embryony); varieties of sweet orange arise through mutations.The orange is a hybrid, between pomelo (Citrus maxima) and mandarin (Citrus reticulata). It has genes that are ~25% pomelo and ~75% mandarin; however, it is not a simple backcrossed BC1 hybrid, but hybridized over multiple generations. The chloroplast genes, and therefore the maternal line, seem to be pomelo. The sweet orange has had its full genome sequenced. Earlier estimates of the percentage of pomelo genes varying from ~50% to 6% have been reported.Sweet oranges were mentioned in Chinese literature in 314 BC. As of 1987, orange trees were found to be the most cultivated fruit tree in the world. Orange trees are widely grown in tropical and subtropical climates for their sweet fruit. The fruit of the orange tree can be eaten fresh, or processed for its juice or fragrant peel. As of 2012, sweet oranges accounted for approximately 70% of citrus production.In 2013, 71.4 million metric tons of oranges were grown worldwide, production being highest in Brazil and the U.S. states of Florida and California.The Wikipedia page for 'Orange_(fruit)' has 172 occurrences of the word 'orange'

- Enjoy.

- Vasudev Ram - Online Python training and programming Signup to hear about new products and services I create.