问题描述:

I am using regular expression to fetch both text1 and text2 in the following html code. Here is what I am using:

/<div\s?class="right-col">[\s\n\S]*<a[\s\n]?[^>]*>@(.*)<\/a>/

but apparently I missed text1, only got text2(here is the link to my problem).

<div class="right-col">

<h1>

<a href="url-link-here" title="title-here">title1</a>

</h1>

<p>some text here</p>

<div class="some-class">

<div class="left">

<span><a href="url-link-here" class="breaking" target="_blank">some text here </a></span>

</div>

<div class="postmeta"><a href="url-link-here" >@text1</a> </div>

</div>

<div class="right-col">

<h1>

<a href="url-link-here" title="title-here">title2</a>

</h1>

<p>some text here</p>

<div class="some-class">

<div class="left">

<span><a href="url-link-here" class="breaking" target="_blank">some text here </a></span>

</div>

<div class="postmeta"><a href="url-link-here" >@text2</a> </div>

</div>

Can you guys tell me what went wrong in my regular expression? Is there a better way to capture both title1, title2 and text1, text2?

网友答案:

This is a fairly common issue with regular expressions as they are greedy. [\s\S]* (the \n is not needed) matches for the first '<' and 'a' and since it's greedy it will match those and continue. Adding a ? makes it not greedy and using your link returns both text1 and text2.

The short answer is to replace [\s\n\S]* with [\s\S]*? but as others have mentioned, this is probably not a good use of regular expressions.

网友答案:

Using a regular expression here is not the best way to do it. It's bad practice. You should be using a DOM/XML parser to do this.

I like using PHP's DOMDocument class. Using XPath, we can quickly find the elements you want

$dom = new DOMDocument;
$dom->loadHTML($html);
$xPath = new DOMXPath($dom);

$aTags = $xPath->query('//div[@class="some-class"]//a[starts-with(text(), "@")]');

foreach($aTags as $a){
    echo $a->nodeValue;
}

DEMO: http://codepad.viper-7.com/QHOXzH

相关阅读:
Top