问题描述:

In Python, how would I go about grabbing all the titles, and plane text from a wikipedia article such as: https://en.wikipedia.org/wiki/Amadeus_(film). My current code is this:

 from bs4 import BeautifulSoup

# ---- Definitions ----#

#Amount of documents

amount_of_documents = 1

#Directory of raw HTML documents

directory_of_raw_documents = "raw_documents/"

#Directory of parsed documents

directory_of_parsed_documents = "parsed_documents/"

# ---- Code ----#

def open_document():

for i in range (1, 1+1):

with open(directory_of_raw_documents + str(i), "r") as document:

html = document.read()

soup = BeautifulSoup(html, "html.parser")

body = soup.find('div', id='bodyContent')

for elements in body.find_all('p'):

print(elements.text)

open_document()

I am loading a downloaded HTML file, then using BeautifulSoup to grab all the content between the <p> tags. My goal is to grab all titles and plain text content of this article. How would I go about doing this?

In the example posted above, my desired output would be contains:

  1. All the titles (Amadeus (film), Plot, Cast, Reception, etc)
  2. All the text within this page (between the <p> tags)
  3. IGNORING references

网友答案:

You might be interested in using specialized wikipedia page parsers like wikipedia package. This way you may get the contents easily:

In [1]: import wikipedia

In [2]: page = wikipedia.page("Amadeus (film)")

In [3]: page.summary
Out[3]: u"Amadeus is a 1984 American period drama film directed by Milo\u0161 Forman, written by Peter Shaffer, and adapted from Shaffer's stage play Amadeus (1979). The story, set in Vienna, Austria, during the latter half of the 18th century, is a fictionalized biography of Wolfgang Amadeus Mozart. Mozart's music is heard extensively in the soundtrack of the movie. Its central thesis is that Antonio Salieri, an Italian contemporary of Mozart is so driven by jealousy of the latter and his success as a composer that he plans to kill him and to pass off a Requiem, which he secretly commissioned from Mozart as his own, to be premiered at Mozart's funeral. Historically, the Requiem which was never finished was commissioned by Count von Walsegg and Salieri, far from being jealous of Mozart, was on good terms with him and even tutored his son after Mozart's death.\nThe film was nominated for 53 awards and received 40, which included eight Academy Awards (including Best Picture), four BAFTA Awards, four Golden Globes, and a Directors Guild of America (DGA) award. As of 2016, it is the most recent film to have more than one nomination in the Academy Award for Best Actor category. In 1998, the American Film Institute ranked Amadeus 53rd on its 100 Years... 100 Movies list."

In [4]: page.content
Out[4]: u'Amadeus is a 1984 American period drama film directed by Milo\u0161 Forman, written by Peter Shaffer, and adapted from Shaffer\'s s
...
Amadeus Filming locations at Movieloci.com'

As far as getting the titles, here is a sample code to get them via BeautifulSoup:

In [1]: import requests

In [2]: from bs4 import BeautifulSoup

In [3]: url = "https://en.wikipedia.org/wiki/Amadeus_(film)"

In [4]: response = requests.get(url)

In [5]: soup = BeautifulSoup(response.content, "html.parser")

In [6]: [item.get_text() for item in soup.select("h2 .mw-headline")]
Out[6]: 
[u'Plot',
 u'Cast',
 u'Production',
 u'Reception',
 u'Alternative versions',
 u'Music',
 u'Awards and nominations',
 u'References',
 u'External links']

The h2 .mw-headline is a CSS selector that would match elements with mw-headline class under the h2 parent elements.

相关阅读:
Top