问题描述:

I'm currently trying to figure out a way to write a script (preferrably PHP) that would crawl through a site and create a sitemap. In addition to the traditional standard listing of pages, I'd like the script to keep track of which pages link to other pages.

Example pages

A

B

C

D

I'd like the output to give me something like the following.

Page Name: A

Pages linking to Page A:

  • B
  • C
  • D

Page Name: B

Pages linking to Page B:

  • A
  • C

etc...

I've come across multiple standard sitemap scripts, but nothing that really accomplishes what I am looking for.


EDIT

Seems I didn't give enough info. Sorry about my lack of clarity there. Here is the code I currently have. I've used simple_html_dom.php to take care of the tasks of parsing and searching through the html for me.

<?php

include("simple_html_dom.php");

url = 'page_url';

$html = new simple_html_dom();

$html->load_file($url);

$linkmap = array();

foreach($html->find('a') as $link):

if(contains("cms/education",$link)):

if(!in_array($link, $linkmap)):

$linkmap[$link->href] = array();

endif;

endif;

endforeach;

?>

Note: My little foreach loop just filters based on a specific substring in the url.

So, I have the necessary first level pages. Where I am stuck is in creating a loop that will not run indefinitely, while keeping track of the pages you have already visited.

网友答案:

Basically, you need two arrays to control the flow here. The first will keep track of the pages you need to look at and the second will track the pages you have already looked at. Then you just run your existing code on each page until there are none left:

<?php

include("simple_html_dom.php");

$urlsToCheck = array();
$urlsToCheck[] = 'page_url';
$urlsChecked = array();

while(count($urlsToCheck) > 0)
{
   $url = array_pop($urlsToCheck);
   if (!in_array($url, $urlsChecked)
   {
      $urlsChecked[] = $url;

      $html = new simple_html_dom(); 
      $html->load_file($url);

      $linkmap = array();

      foreach($html->find('a') as $link):
          if(contains("cms/education",$link)):
              if((!in_array($link, $urlsToCheck)) && (!in_array($link,$urlsChecked)))
                 $urlsToCheck[] = $link;

              if(!in_array($link, $linkmap)):
                  $linkmap[$link->href] = array();
              endif;
          endif;
      endforeach;
   }
}

?>
相关阅读:
Top