≡ Menu
Deep in the Code

Using PHP to Scrape the Report Card from a Code School Profile – Part 1

I have been manually placing my Master badges from Code School onto my blog which, being a WordPress blog, runs on PHP. I don’t have the script running that shows the fraction completed on Paths that I haven’t yet mastered, but I can get the ones that I’ve completed.

The PHP script I’ve built so far is below. Due to some CSS I want to change, I haven’t implemented it yet. But I can say that it does indeed scrape the page. The jQuery required to display it in the sidebar I’ll share once I get the CSS issues worked out. This is very similar to the code I used to get the CodeEval profile, but it’s been refactored and modified for use with the Code School page.

<?php
    function getClass($classname, $htmltext)
    {
        $dom = new DOMDocument;
        $dom->loadHTML($htmltext);
        $xpath = new DOMXPath($dom);
        $results = $xpath->query("//*[@class='" . $classname . "']");
        return $results;
    }
    
    
    function buildContent($results)
    {
        $content = "";
        foreach ($results as $node) {
            $partial_content = innerHTML($node);
            $content = $content . $partial_content;
        }
        return $content;
    }
    
    
    /* this function preserves the inner content of the scraped element. 
    ** http://stackoverflow.com/questions/5349310/how-to-scrape-web-page-data-without-losing-tags
    ** So be sure to go and give that post an uptick too 🙂
    **/
    function innerHTML(DOMNode $node)
    {
      $doc = new DOMDocument();
      foreach ($node->childNodes as $child) {
        $doc->appendChild($doc->importNode($child, true));
      }
      return $doc->saveHTML();
    }
    
    
    $previous_value = libxml_use_internal_errors(TRUE);
    $profilename = $_GET['nick'];
    $profile_url =  'https://www.codeschool.com/users/' . $profilename . '/';
    $context = stream_context_create(array(
        'https' => array('ignore_errors' => true),
    ));
    $html = file_get_contents($profile_url, false, $context);  

    $class = 'bucket';
    $resultsBucket = getClass($class,$html);
    
    $class = 'mbl tac';
    $resultsMaster = getClass($class,$html);
    
    $class = 'pr-pathStatus';
    $resultsPath = getClass($class,$html);
    
    libxml_clear_errors();
    libxml_use_internal_errors($previous_value);
?>

<a href="<?php echo $profile_url; ?>" target="_blank">
    <div class="wrapper-scores">
        <?php
        $full_content = "";
        
        $full_content = $full_content . buildContent($resultsBucket);
        $full_content = $full_content . buildContent($resultsMaster);
        $full_content = $full_content . buildContent($resultsPath);
        
        /* changing h2 tags to h1 tags and inserting line breaks */
        $full_content = str_replace("<h2","<br /><h1",$full_content);
        $full_content = str_replace("</h2>","</h1><br />",$full_content);
        
        /* disabling the anchor tags on each badge by changing to divs */
        $full_content = str_replace("<a href","<div class",$full_content);
        $full_content = str_replace("</a>","</div>",$full_content);
        
        /* changing text on heading of Path Status */
        $full_content = str_replace("Path Status","Paths In Progress",$full_content);
        
        /* return the html */
        echo $full_content;
        ?>
    </div>
</a>

About the author: I solve problems. Solutions Architect / Senior Software Engineer / Business Analyst / Full-Stack Developer / IT Generalist