Using Ruby to Get All Links from a Sitemap XML File

I was looking at the cached pages on the Wayback Machine and decided to find out if that site had an API. (It does.) I wanted to find a way to use this API to submit links to the Wayback Machine database. As it turns out, a Ruby gem called WaybackArchiver has already been written for just this purpose!

Now that I had an easy way to submit multiple URLs, I got to thinking about how this could be more easily automated. What if I had a sitemap that contained all of the links I wanted to submit? What if there is already a gem to parse a sitemaps.org-compliant sitemap, such as the one on WordPress sites that use the Google Sitemap Generator Plugin?

a sitemap.xml file
My WordPress sitemap, generated by the Google Sitemap Generator Plugin

Even though all of these things exist, I have not found where they have been combined, so I did just that. The Ruby script I wrote requires several gems:
1. WaybackArchiver, for submitting links, sitemaps, or pages to the Wayback Machine
2. Sitemap-Parser, for parsing sitemaps
3. OpenURI, for opening websites
4. Nokogiri, for parsing XML (also HTML, SAX, and Reader) files

If these gems are not installed already, you can install them at the Ruby prompt with “gem install” and the name of each gem.

The script, which I named map.rb, is below.

require 'wayback_archiver'
require 'sitemap-parser'
require 'open-uri'
require 'nokogiri'

mainSitemapURL = ARGV[0]
if not mainSitemapURL.nil?
  puts 'Running...' #+ mainSitemapURL

  #mainSitemap = SitemapParser.new mainSitemapURL
  mainSitemap = Nokogiri::HTML(open(mainSitemapURL))
  #puts mainSitemap
  mainSitemap.xpath("//sitemap/loc").each do |node|
    #puts node.content
    subSitemapURL = node.content
    subSitemap = SitemapParser.new subSitemapURL
    arraySubSitemap = subSitemap.to_a
    (0..arraySubSitemap.length-1).each do |j|
      #puts arraySubSitemap[j]
      WaybackArchiver.archive(arraySubSitemap[j], :url)
    end
  end
end
puts 'Finished.'

This script works unaltered with WordPress sitemaps created by the plugin mentioned above. These sitemaps actually produce a sitemap index, with individual sitemaps being linked here – so each of the linked sitemaps are also parsed. The above script can be run at the Ruby prompt with “ruby map.rb URL“, substituting the URL for the sitemap.

Should your sitemap be organized differently, this code may not work as-is, but may require changes depending on the node tag names.

Other pages that were helpful in developing this tool:
Looping through each xml node (on Stack Overflow)
Web Scraping with Ruby and Nokogiri for Beginners
Parsing an HTML/XML Document

Leave a Reply