Using Ruby to Get All Links from a Sitemap XML File

a sitemap.xml file

I was looking at the cached pages on the Wayback Machine and decided to find out if that site had an API. (It does.) I wanted to find a way to use this API to submit links to the Wayback Machine database. As it turns out, a Ruby gem called WaybackArchiver has already been written for just this purpose!

Now that I had an easy way to submit multiple URLs, I got to thinking about how this could be more easily automated. What if I had a sitemap that contained all of the links I wanted to submit? What if there is already a gem to parse a sitemaps.org-compliant sitemap, such as the one on WordPress sites that use the Google Sitemap Generator Plugin?

a sitemap.xml file
My WordPress sitemap, generated by the Google Sitemap Generator Plugin

Even though all of these things exist, I have not found where they have been combined, so I did just that. The Ruby script I wrote requires several gems:
1. WaybackArchiver, for submitting links, sitemaps, or pages to the Wayback Machine
2. Sitemap-Parser, for parsing sitemaps
3. OpenURI, for opening websites
4. Nokogiri, for parsing XML (also HTML, SAX, and Reader) files

If these gems are not installed already, you can install them at the Ruby prompt with “gem install” and the name of each gem.

The script, which I named map.rb, is below.

require 'wayback_archiver'
require 'sitemap-parser'
require 'open-uri'
require 'nokogiri'

mainSitemapURL = ARGV[0]
if not mainSitemapURL.nil?
  puts 'Running...' #+ mainSitemapURL

  #mainSitemap = SitemapParser.new mainSitemapURL
  mainSitemap = Nokogiri::HTML(open(mainSitemapURL))
  #puts mainSitemap
  mainSitemap.xpath("//sitemap/loc").each do |node|
    #puts node.content
    subSitemapURL = node.content
    subSitemap = SitemapParser.new subSitemapURL
    arraySubSitemap = subSitemap.to_a
    (0..arraySubSitemap.length-1).each do |j|
      #puts arraySubSitemap[j]
      WaybackArchiver.archive(arraySubSitemap[j], :url)
    end
  end
end
puts 'Finished.'

This script works unaltered with WordPress sitemaps created by the plugin mentioned above. These sitemaps actually produce a sitemap index, with individual sitemaps being linked here – so each of the linked sitemaps are also parsed. The above script can be run at the Ruby prompt with “ruby map.rb URL“, substituting the URL for the sitemap.

Should your sitemap be organized differently, this code may not work as-is, but may require changes depending on the node tag names.

Other pages that were helpful in developing this tool:
Looping through each xml node (on Stack Overflow)
Web Scraping with Ruby and Nokogiri for Beginners
Parsing an HTML/XML Document

Some Good Resources on Database Performance Optimization and Tuning

database symbol

Database performance optimization is a skill that is very important for both the database administrator and the database developer. I decided to put together a few links here that may be useful in disseminating information about database tuning and query optimization on various database systems, both relational and non-relational. I’ll be updating this from time to time as I come across new resources.

Microsoft SQL Server:

Optimizing Databases
SQL Database Performance Tuning for Developers

Oracle Database:

Database SQL Tuning Guide

PostgreSQL:

Tuning Your PostgreSQL Server

MySQL:

How to optimize a MySQL database

MongoDB:

Optimization Strategies for MongoDB

database symbol

Fun with CFAJAXPROXY on ColdFusion 11

my Stack Overflow question

In the course of moving an application from ColdFusion 8 to ColdFusion 11, I came across some strange behavior caused by the CFAJAXPROXY tag.

In CF8, a particular CFCOMPONENT called ProjectBeanService had proxies set up for its methods in the rendered JavaScript like this:

var _cf_ProjectBeanService = ColdFusion.AjaxProxy.init('/components/ProjectBeanService.cfc', 'ProjectBeanService');
_cf_ProjectBeanService.prototype.get = function(sPropertyName, sBeanType, nID, sSection, nRevision) { return ColdFusion.AjaxProxy.invoke(this, "get",  {sPropertyName:sPropertyName, sBeanType:sBeanType, nID:nID, sSection:sSection, nRevision:nRevision});};
_cf_ProjectBeanService.prototype.getAll = function(sBeanType, nID, sSection, nRevision) { return ColdFusion.AjaxProxy.invoke(this, "getAll", {sBeanType:sBeanType, nID:nID, sSection:sSection, nRevision:nRevision});};
_cf_ProjectBeanService.prototype.set = function(sPropertyName, oPropertyValue, sBeanType, nID, sSection, nRevision) { return ColdFusion.AjaxProxy.invoke(this, "set", {sPropertyName:sPropertyName, oPropertyValue:oPropertyValue, sBeanType:sBeanType, nID:nID, sSection:sSection, nRevision:nRevision});};

However, in CF11, more proxies were created:

/* <![CDATA[ */
var _cf_ProjectBeanService = ColdFusion.AjaxProxy.init('/components/ProjectBeanService.cfc', 'ProjectBeanService');
_cf_ProjectBeanService.prototype.get = function(sPropertyName, sBeanType, nID, sSection, nRevision) { return ColdFusion.AjaxProxy.invoke(this, "get",  {sPropertyName:sPropertyName, sBeanType:sBeanType, nID:nID, sSection:sSection, nRevision:nRevision});};
_cf_ProjectBeanService.prototype.getAll = function(sBeanType, nID, sSection, nRevision) { return ColdFusion.AjaxProxy.invoke(this, "getAll", {sBeanType:sBeanType, nID:nID, sSection:sSection, nRevision:nRevision});};
_cf_ProjectBeanService.prototype.set = function(sPropertyName, oPropertyValue, sBeanType, nID, sSection, nRevision) { return ColdFusion.AjaxProxy.invoke(this, "set", {sPropertyName:sPropertyName, oPropertyValue:oPropertyValue, sBeanType:sBeanType, nID:nID, sSection:sSection, nRevision:nRevision});};
_cf_ProjectBeanService.prototype.get = function(sBeanName, sPropertyName) { return ColdFusion.AjaxProxy.invoke(this, "get","4789898A8974AC60", {sBeanName:sBeanName, sPropertyName:sPropertyName});};
_cf_ProjectBeanService.prototype.destroySessionBean = function(sBeanName) { return ColdFusion.AjaxProxy.invoke(this, "destroySessionBean", "4789898A8974AC60", {sBeanName:sBeanName});};
_cf_ProjectBeanService.prototype.createSessionBean = function(sBeanName, sBeanType, sDAOName) { return ColdFusion.AjaxProxy.invoke(this, "createSessionBean", "4789898A8974AC60", {sBeanName:sBeanName, sBeanType:sBeanType, sDAOName:sDAOName});};
_cf_ProjectBeanService.prototype.getAll = function(sBeanName) { return ColdFusion.AjaxProxy.invoke(this, "getAll", "4789898A8974AC60", {sBeanName:sBeanName});};
_cf_ProjectBeanService.prototype.getSessionBean = function(sBeanName) { return ColdFusion.AjaxProxy.invoke(this, "getSessionBean","4789898A8974AC60", {sBeanName:sBeanName});};
_cf_ProjectBeanService.prototype.set = function(sBeanName, sPropertyName, oPropertyValue) { return ColdFusion.AjaxProxy.invoke(this, "set", "4789898A8974AC60", {sBeanName:sBeanName, sPropertyName:sPropertyName, oPropertyValue:oPropertyValue});};
_cf_ProjectBeanService.prototype.reInitSessionBean = function(sBeanName, argument1, argument2, argument3, argument4) { return ColdFusion.AjaxProxy.invoke(this, "reInitSessionBean", "4789898A8974AC60", {sBeanName:sBeanName, argument1:argument1, argument2:argument2, argument3:argument3, argument4:argument4});};
/* ]]> */

This made no sense to me, as the ProjectBeanService class only had the three methods declared that were proxied in CF8. I looked at the ProjectBeanService.cfc file:

<cfcomponent displayname = "ProjectBeanService" extends = "com.AjaxBeanService">

    <cffunction name = "getBean" access = "private" returntype = "any">
        <cfargument name = "sBeanType" type = "string" required = "yes">
        <cfargument name = "nID" type = "numeric" required = "yes" hint = "ProjectID or ImpactID">
        <cfargument name = "sSection" type = "string" required = "no" hint = "ProjectSection or ImpactSection" default = "">
        <cfargument name = "nRevision" type = "numeric" required = "no" hint = "Commitment Revision" default = "0">

        <cfset var oBean = createObject("component","com." & sBeanType).init(nID,sSection,nRevision)  />

        <cfreturn oBean />

    </cffunction>

    <cffunction name = "set" access = "remote" returntype = "void">
        <cfargument name = "sPropertyName" type = "string" required = "yes">
        <cfargument name = "oPropertyValue" type = "string" required = "yes">
        <cfargument name = "sBeanType" type = "string" required = "yes">
        <cfargument name = "nID" type = "numeric" required = "yes">
        <cfargument name = "sSection" type = "string" required = "no" default = "">
        <cfargument name = "nRevision" type = "numeric" required = "no" default = "0">
        <cfset var oBean = StructNew() />
        <cftry>

            <cfset oBean = getBean(sBeanType, nID, sSection,nRevision) />
            <cfset oBean.set(sPropertyName,oPropertyValue) />
            <cfcatch type = "any">
                 <cfset sendError(cfcatch.ErrorCode,cfcatch.message) />
            </cfcatch>
        </cftry>
     </cffunction>

    <cffunction name = "get" access = "remote" returntype = "any">
        <cfargument name = "sPropertyName" type = "string" required = "yes">
        <cfargument name = "sBeanType" type = "string" required = "yes">
        <cfargument name = "nID" type = "numeric" required = "yes">
        <cfargument name = "sSection" type = "string" required = "no" default = "">
        <cfargument name = "nRevision" type = "numeric" required = "no" default = "0">

        <cfset var value = "" />
        <cfset var oBean = StructNew() />

        <cftry>
            <cfset oBean = getBean(sBeanType,nID,sSection,nRevision) />
            <cfset value = oBean.get(sPropertyName) />
            <cfreturn value />

            <cfcatch type = "any">
                    <cfset sendError(cfcatch.ErrorCode,cfcatch.message) />
            </cfcatch>
        </cftry>
     </cffunction>

     <cffunction name = "getAll" access = "remote" returntype = "struct">
        <cfargument name = "sBeanType" type = "string" required = "no" default = "ProjectBean">
        <cfargument name = "nID" type = "numeric" required = "yes">
        <cfargument name = "sSection" type = "string" required = "no" default = "">
        <cfargument name = "nRevision" type = "numeric" required = "no" default = "0">

        <cfset var oBean = StructNew() />
        <cfset var oStruct = structNew() />

        <cftry>
            <cfset oBean = getBean(sBeanType,nID,sSection,nRevision) />
            <cfset oStruct = oBean.getAll() />
            <cfreturn oStruct />

            <cfcatch type = "any">
                  <cfset sendError(cfcatch.ErrorCode,cfcatch.message) />
            </cfcatch>
        </cftry>
     </cffunction>

</cfcomponent>

I saw that this class extended another class, AjaxBeanService:

<cfcomponent displayname = "AjaxBeanService" extends = "com.AbstractAjax">

    <cffunction name = "createSessionBean" access = "remote" returntype = "struct">
        <cfargument name = "sBeanName" type = "string" required = "yes">
        <cfargument name = "sBeanType" type = "string" required = "yes">
        <cfargument name = "sDAOName" type = "string" required = "yes">
        <cfset var oBean = StructNew() />
        <cfset var oBeanArguments = ARGUMENTS />
        <cfset var oDAO = application[sDAOName] />

        <cftry>
            <cfset oBean = createObject("component","com." & sBeanType) />

             <!--- delete first 3 elements from arguments array --->
            <cfset ArrayDeleteAt(oBeanArguments,1) />
            <cfset ArrayDeleteAt(oBeanArguments,1) />
            <cfset ArrayDeleteAt(oBeanArguments,1) />

            <!--- make the DAO object the first argument --->
            <cfset ArrayPrepend(oBeanArguments,oDAO) />

            <cfset oBean.init.apply(oBean,oBeanArguments) />
            <cfset SESSION.beans[sBeanName] = oBean />
            <cfreturn oBean.getAll() />
            <cfcatch type = "any">
                <cfset sendError(cfcatch.ErrorCode,cfcatch.message) />
            </cfcatch>
        </cftry>
    </cffunction>

    <cffunction name = "destroySessionBean" access = "remote" returntype = "struct">
        <cfargument name = "sBeanName" type = "string" required = "yes">
        <cfset rc = StructDelete(SESSION.beans, "#sBeanName#", "True")>
    </cffunction>

    <cffunction name = "reInitSessionBean" access = "remote" returntype = "struct">
        <cfargument name = "sBeanName" type = "string" required = "yes">
        <cfargument name = "argument1" type = "any" required = "no" default = "">
        <cfargument name = "argument2" type = "any" required = "no" default = "">
        <cfargument name = "argument3" type = "any" required = "no" default = "">
        <cfargument name = "argument4" type = "any" required = "no" default = "">
       <cfset var oBean = StructNew() />
        <cftry>
            <cfset oBean = getSessionBean(sBeanName) />
            <cfset oBean.init(oBean.getDAO(),argument1,argument2,argument3,argument4) />
            <cfset SESSION.beans[sBeanName] = oBean />
            <cfreturn oBean.getAll() />
            <cfcatch type = "any">
                <cfset sendError(cfcatch.ErrorCode,cfcatch.message) />
            </cfcatch>
        </cftry>
    </cffunction>

    <cffunction name = "getSessionBean" access = "remote" returntype = "any">
        <cfargument name = "sBeanName" type = "string" required = "yes">
        <cfset var oBean = StructNew() />
        <cfif StructKeyExists(SESSION.beans,sBeanName) >
            <cflock scope = "session" type = "readonly" timeout = "5" throwontimeout = "yes">
                <cfset oBean = Duplicate(SESSION.beans[sBeanName]) />
            </cflock>
        </cfif>
        <cfif StructIsEmpty(oBean)>
            <cfthrow errorcode = "500" message = "No bean found by the name '#sBeanname#'" />
        <cfelse>
            <cfreturn oBean />
        </cfif>
    </cffunction>

    <cffunction name = "set" access = "remote" returntype = "void">
        <cfargument name = "sBeanName" type = "string" required = "yes">
        <cfargument name = "sPropertyName" type = "string" required = "yes">
        <cfargument name = "oPropertyValue" type = "string" required = "yes">
        <cfset var oBean = StructNew() />
        <cftry>
            <cfset oBean = getSessionBean(sBeanName) />
            <cfset oBean.set(sPropertyName,oPropertyValue) />
            <cfcatch type = "any">
                 <cfset sendError(cfcatch.ErrorCode,cfcatch.message) />
            </cfcatch>
        </cftry>
     </cffunction>

    <cffunction name = "get" access = "remote" returntype = "any">
        <cfargument name = "sBeanName" type = "string" required = "yes">
        <cfargument name = "sPropertyName" type = "string" required = "yes">

        <cfset var value = "" />
        <cfset var oBean = StructNew() />

        <cftry>
            <cfset oBean = getSessionBean(sBeanName) />
            <cfset value = oBean.get(sPropertyName) />
            <cfreturn value />

            <cfcatch type = "any">
                    <cfset sendError(cfcatch.ErrorCode,cfcatch.message) />
            </cfcatch>
        </cftry>
     </cffunction>

     <cffunction name = "getAll" access = "remote" returntype = "struct">
        <cfargument name = "sBeanName" type = "string" required = "yes">

        <cfset var oBean = StructNew() />
        <cfset var oStruct = structNew() />

        <cftry>
            <cfset oBean = getSessionBean(sBeanName) />
            <cfset oStruct = oBean.getAll() />
            <cfreturn oStruct />

            <cfcatch type = "any">
                  <cfset sendError(cfcatch.ErrorCode,cfcatch.message) />
            </cfcatch>
        </cftry>
     </cffunction>

</cfcomponent>

This class contained the methods that were showing up in the rendered code on CF11.

I found that at least part of this issue was most certainly a bug in CF8 that had been corrected in CF9. In CF8, a child class apparently did not have access to the methods of the parent class via CFAJAXPROXY if the parent class methods were marked as having “remote” access. No longer, in CF9 and subsequent versions. Source (in comments): Ask a Jedi: ColdFusion Ajax example of retrieving fields of data (2)

However, that did not explain why the parent class method proxies were being rendered AFTER the child class methods, thus preventing the child from overriding the parent class methods.

So far, the only fix I have is to mark the parent methods access attributes as “public” rather than “remote”. This seems like a poor way to do it, and may yet have unintended consequences, but without writing new JavaScript to avoid the use of CFAJAXPROXY, this may be the best solution for now.

Incidentally, an article on doing that very thing is here, in case anyone needs it: Creating A Remote AJAX Proxy In Javascript Without ColdFusion 8’s CFAjaxProxy

If anyone has any better suggestions on how to fix this problem, I’ve submitted it as a question on Stack Overflow.

my Stack Overflow question