Crawl a webpage for all url

Method – 1

<?php
function crawl_page($url, $depth = 5)
{
 $arr = array();
 static $seen = array();
 if (isset($seen[$url]) || $depth === 0) {
 return;
 }

$seen[$url] = true;

$dom = new DOMDocument('1.0');
 @$dom->loadHTMLFile($url);

$anchors = $dom->getElementsByTagName('a');
 foreach ($anchors as $element) {
 $href = $element->getAttribute('href');
 if (0 !== strpos($href, 'http')) {
 $path = '/' . ltrim($href, '/');
 if (extension_loaded('http')) {
 $href = http_build_url($url, array('path' => $path));
 } else {
 $parts = parse_url($url);
 $href = $parts['scheme'] . '://';
 if (isset($parts['user']) && isset($parts['pass'])) {
 $href .= $parts['user'] . ':' . $parts['pass'] . '@';
 }
 $href .= $parts['host'];
 if (isset($parts['port'])) {
 $href .= ':' . $parts['port'];
 }
 $href .= dirname($parts['path'], 1).$path;
 }
 }
 crawl_page($href, $depth - 1);
 $arr[] = $href;
 }
 //echo "URL:",$url,PHP_EOL,"CONTENT:",PHP_EOL,$dom->saveHTML(),PHP_EOL,PHP_EOL;
 return $arr;
}
$a = crawl_page("http://79.127.126.110/Serial/Agents%20of%20S.H.I.E.L.D/S02/480p/", 2); 
print_r($a);

?>


Method – 2

In it’s simplest form:

function crawl_page($url, $depth = 5) {
    if($depth > 0) {
        $html = file_get_contents($url);

        preg_match_all('~<a.*?href="(.*?)".*?>~', $html, $matches);

        foreach($matches[1] as $newurl) {
            crawl_page($newurl, $depth - 1);
        }

        file_put_contents('results.txt', $newurl."\n\n".$html."\n\n", FILE_APPEND);
    }
}

crawl_page('http://www.domain.com/index.php', 5);

That function will get contents from a page, then crawl all found links and save the contents to ‘results.txt’. The functions accepts an second parameter, depth, which defines how long the links should be followed. Pass 1 there if you want to parse only links from the given page