Using PHP to Scrape the Report Card from a DataCamp Profile

UPDATE: This does not currently work, as DataCamp has changed the structure of profile pages. I will revise this as soon as is feasible. (7/26/17)

Just as I have written scripts for displaying report cards for Code School, CodeEval, and Duolingo on my blog, I have written the below script for displaying a DataCamp profile.

datacamp.php:

<style>
#datacamp {
   border: 1px solid blue;
   text-align: center;
   vertical-align: middle;
   width: 100%;
}

#datacamp li {
   list-style-type: none;
}

.image-centered {
   display:block;
   margin-left: auto;
   margin-right: auto;
   margin-bottom: 10px;
}

.image-rounded {
    border-radius: 50%;
}

.progress-bar {
    position: relative;
    border: 1px solid #33aacc;
    width: 100%;
    height: 18px;
    margin-bottom: 1rem;
}

.progress-bar .inner {
   position: absolute;
   left: 0;
   top: 0;
   bottom: 0;
   background-color: #33aacc;
   min-width: 5px;
}

.wrapper-scores .container {
   min-width: 150px;
   max-width: 260px;
   width: 100%;
}

.course-block__completed, .course-block__certificate-download, .btn-linkedin-share,
.course-block__description, .course-block__author {
   display: none;
}

.col-sm-4 img {
   display: inline;
   padding: 1px;
}
</style>
<?php
    date_default_timezone_set('America/Los_Angeles');

    function getClass($classname, $htmltext)
    {
        $dom = new DOMDocument;
        $dom->loadHTML($htmltext);
        $xpath = new DOMXPath($dom);
        $results = $xpath->query("//*[@class='" . $classname . "']");
        return $results;
    }

    function buildContent($results)
    {
        $content = "";
        foreach ($results as $node) {
            $partial_content = innerHTML($node);
            $content = $content . $partial_content;
        }
        return $content;
    }

    /* this function preserves the inner content of the scraped element.
    ** http://stackoverflow.com/questions/5349310/how-to-scrape-web-page-data-without-losing-tags
    ** So be sure to go and give that post an uptick too 🙂
    **/
    function innerHTML(DOMNode $node)
    {
      $doc = new DOMDocument();
      foreach ($node->childNodes as $child) {
        $doc->appendChild($doc->importNode($child, true));
      }
      return $doc->saveHTML();
    }

    $profilename = $_GET['nick'];
    if (strlen($profilename) == 0)
	exit(1);
    $profile_url =  'https://www.datacamp.com/profile/' . $profilename . '/';
    $filename = "datacamp_" . $profilename . ".txt";
    $full_content = '';
    $norefresh = FALSE;
    $days = 1;
    $updated = 'no date';

    /* checks to see if file exists and is current */
    if (file_exists($filename)) {
	    $stats = stat($filename);
	    /* 86400 seconds in one day */
	    if ($stats[9] > (time() - (86400 * $days))) {
	    	$norefresh = TRUE;
	    	$updated = date("Y-m-d H:i:s", $stats[9]);
	    }
    }

    /* if $norefresh is still FALSE, file will be created or updated; otherwise, it will be loaded */
    if ($norefresh) {
    	$full_content = file_get_contents($filename);
    } else {
	$previous_value = libxml_use_internal_errors(TRUE);
	$context = stream_context_create(array(
	'https' => array('ignore_errors' => true),
	));
	$html = file_get_contents($profile_url, false, $context);  

	$class = 'profile-page';
	$resultsBucket = getClass($class,$html);

	libxml_clear_errors();
	libxml_use_internal_errors($previous_value);

	$full_content = $full_content . buildContent($resultsBucket);

	/* making sure correct path exists on images */
	$full_content = str_replace("src=\"/","src=\"http://datacamp.com/",$full_content);

	/* changing h2 tags to h1 tags and inserting line breaks */
	$full_content = str_replace("<h2","<br /><h1",$full_content);
	$full_content = str_replace("</h2>","</h1><br />",$full_content);

	/* disabling the anchor tags on each badge by changing to divs */
	$full_content = str_replace("<a href","<div class",$full_content);
	$full_content = str_replace("<a class","<div class",$full_content);
	$full_content = str_replace("</a>","</div>",$full_content);

	/* adding line breaks */
	$full_content = str_replace("<div class=\"stats\">","<br /><div class=\"stats\">",$full_content);
	$full_content = str_replace("Earned</span>","Earned</span><br />",$full_content);
	$full_content = str_replace("Completed</span>","Completed</span><br />",$full_content);
	$full_content = str_replace("Aced</span>","Aced</span><br /><br />",$full_content);

	file_put_contents($filename,$full_content);
	$updated = date("Y-m-d H:i:s");
    }
?>
<a href="<?php echo $profile_url; ?>" target="_blank">
	<div class="wrapper-scores">
	<!-- <?php echo "Last updated: $updated" ?> -->
	<?php
	    	/* return the html */
		echo $full_content;
	?>
	</div>
</a>

If you wish to display this in a WordPress widget, create a Text widget and add this code, replacing “NICKNAME” with a DataCamp username.

<div id="datacamp"></div>
<script>
(function($) {
$("#datacamp").load("/datacamp/datacamp.php?nick=NICKNAME");
})(jQuery);
</script>

Since my primary reason for writing these has been to populate widgets on this WordPress blog, at some point I’ll probably incorporate these into WP plugins.

Leave a Reply