问题描述:

I'm writing a python snippet to fix the casing of titles in HTML code. So far, I've come up with this code:

pattern = re.compile("<h1>(.*)</h1>|<h2>(.*)</h2>|<h3>(.*)</h3>|<h4>(.*)</h4>|<h5>(.*)</h5>|<h6>(.*)</h6>")

def replace(m):

contents = m.group(1)

replacement = contents[0] + contents[1:].lower()

return replacement

Then, given a line, the transformation I use is line = pattern.sub(replace, line).

This doesn't work, because m.group(1) is always None, whereas I'd like it to be the match corresponding to any of the clauses in my regex. Since patterns can't share a name in python, I'm somewhat at a loss.

An obvious solution is to group all the patterns which I used, but then <h1>bla</h2> would be recognized. That's not good, since <h1><a href="...">Bla</a></h1> <h2>Bla</h2> should yield two matches (<a href="...">Bla</a>, and <a href="...">Bla</a>)

Ideas?

网友答案:

From what I understand you just want to capitalize all of the headings. You can use lxml which would make this fairly painless:

import lxml.html

doc = lxml.html.parse(your_html)
for i in range(1,7):
    for h in doc.xpath('//h%d' % i):
        h.text = h.text.capitalize()

print lxml.html.tostring(doc)
网友答案:

Why do you care about that? HTML tags are not case sensitive. If you need a proper solution than you use a tool like BeautifulSoup. Parsing HTML using regular expressions is nonsense and never ever recommendable (discussed often enough).

网友答案:

You may want to have a look at this question and all the tons of comments and answers to it. :-)

Use

  • lxml or
  • beautifulsoup

to parse html.

网友答案:

The following XPath expression selects all the wanted text nodes:

//*[starts-with(name(),'h') 
  and 
   substring(name(),2) >= 1 
  and not(substring(name(),2) >6)
   ]
    //text()
相关阅读:
Top