问题描述:

I am using this pattern to remove all HTML tags (Java code):

String html="text <a href=#>link</a> <b>b</b> pic<img src=#>";

html=html.replaceAll("\\<.*?\\>", "");

System.out.println(html);

Now, I want to keep tag <a ...> (with </a>) and tag <img ...>

I want the result to be:

text <a href=#>link</a> b pic<img src=#>

How to do this?


I don't need HTML parser to do this,

because I need this regex pattern to filter a lot of html fragment,

so,I want the solution with regex

网友答案:

You could do this using a negative lookahead:

"<(?!(?:a|/a|img)\\b).*?>"

Rubular

However this has a number of problems and I would recommend instead that you use an HTML parser if you want a robust solution.

For more information see this question:

  • What HTML parsing libraries do you recommend in Java
网友答案:

Check this out http://sourceforge.net/projects/regexcreator/ . This is very handy gui regex editor.

网友答案:

You can’t parse [X]HTML with regex.

网友答案:

Use a proper HTML parser, for example htmlparser, Jericho or the validator.nu HTML parser. Then use the parser’s API, SAX or DOM to pull out the stuff you’re interested in.

If you insist on using regular expressions, you’re almost certain to make some small mistake that will lead to breakage, and possibly to cross-site scripting attacks, depending on what you’re doing with the markup.

See also this answer.

网友答案:

I recommend you use strip_tags (a PHP function)

string strip_tags ( string $str [, string $allowable_tags ] )

    <?php
$text = '<p>Test paragraph.</p><!-- Comment --> <a href="#fragment">Other text</a>';
echo strip_tags($text);
echo "\n";

// Allow <p> and <a>
echo strip_tags($text, '<p><a>');
?>

OUTPUT

Test paragraph. Other text
<p>Test paragraph.</p> <a href="#fragment">Other text</a>
相关阅读:
Top