Evergreen Crawling Using XML Sitemaps

An approach to updating a crawl file by crawling only new and/or updated URLs.
crawling
scraping
advertools
sitemaps
Author

Elias Dabbas

Published

July 26, 2024

This is a way to keep an updated crawl of a website by crawling the whole website once, and then only crawling pages that have changed. This approach uses XML sitemaps and assumes that they have accurate lastmod tags, which will one of the two conditions to re-crawl URLs, the other being new URLs found in the sitemap.

The Plan

  1. Download a sitemap
  2. Crawl its URLs
  3. Save the sitemap to a CSV file
  4. After a month/week/day, download the same sitemap again
  5. Find URLs to crawl:
    1. New URLs not found in the last_sitemap
    2. URLs that exist in both, but with a different lastmod
  6. Crawl the new URLs
  7. Save the current_sitemap and overwrite (optionally rename) the last_sitemap
  8. Repeat OR
  9. Scale

Download a sitemap

Steps listed here are to be done once, the first time we crawl the full list of URLs found in the XML sitemap.

import advertools as adv
import pandas as pd

sitemap_url = 'YOUR_SITEMAP_URL'
last_sitemap = adv.sitemap_to_df(sitemap_url)

Crawl the sitemap’s URLs

The example here shows a simple list-mode crawl with no options. You can customize this as you want of course.

adv.crawl(last_sitemap['loc'], 'output_file.jsonl')

Save the sitemap to disk

last_sitemap.to_csv('last_sitemap.csv', index=False)

Periodic process

The following are the steps that are to be repeated periodically on a monthly, weekly, or even daily basis. First we read the last_sitemap into a DataFrame.

last_sitemap = pd.read_csv('last_sitemap.csv')

Re-download the sitemap

Note that last_sitemap is read from disk, and therefore the lastmod column would be read as a string, and not as a datetime object. We can use the parse_dates option available in the pandas.read_csv function, but dates might have different representations from the one just downloaded, so in this case the easiest is to use both as strings.

So, now we convert the lastmod column’s type to str.

current_sitemap = adv.sitemap_to_df(sitemap_url)
current_sitemap['lastmod'] = current_sitemap['lastmod'].astype(str)

Find new URLs to crawl

  1. Find URLs in current_sitemap that are not in last_sitemap
new_urls = set(current_sitemap['loc']).difference(last_sitemap['loc'])
  1. Get URLs present in both DataFrames but have a different lastmod
lastmod_changed_urls = (pd.merge(
    last_sitemap[['loc', 'lastmod']],
    current_sitemap[['loc', 'lastmod']]
    .rename(columns={'lastmod': 'lastmod2'}))
                        .assign(lastmod_changed=lambda df: df['lastmod2'].ne(df['lastmod']))
                        .query('lastmod_changed==True')
                        ['loc'].tolist())

Combine the URLs to crawl and crawl them

urls_to_crawl = lastmod_changed_urls + list(new_urls)
adv.crawl(urls_to_crawl, 'output_file.jl')

Save the current_sitemap to disk overwriting the old sitemap

current_sitemap.to_csv('last_sitemap.csv', index=False)

Putting it all together

One-off sitemap and crawl code:

sitemap_url = 'YOUR_SITEMAP_URL'
last_sitemap = adv.sitemap_to_df(sitemap_url)
adv.crawl(last_sitemap['loc'], 'output_file.jsonl')
last_sitemap.to_csv('last_sitemap.csv', index=False)

Periodic update

evergreen_crawling.py
sitemap_url = 'YOUR_SITEMAP_URL'
last_sitemap = pd.read_csv('last_sitemap.csv')
current_sitemap = adv.sitemap_to_df(sitemap_url)
current_sitemap['lastmod'] = current_sitemap['lastmod'].astype(str)
new_urls = set(current_sitemap['loc']).difference(last_sitemap['loc'])

lastmod_changed_urls = (pd.merge(
    last_sitemap[['loc', 'lastmod']],
    current_sitemap[['loc', 'lastmod']]
    .rename(columns={'lastmod': 'lastmod2'}))
                        .assign(lastmod_changed=lambda df: df['lastmod2'].ne(df['lastmod']))
                        .query('lastmod_changed==True')
                        ['loc'].tolist())

urls_to_crawl = lastmod_changed_urls + list(new_urls)
adv.crawl(urls_to_crawl, 'output_file.jl')
current_sitemap.to_csv('last_sitemap.csv', index=False)

Automating the process

Assuming you’ve done the first part, and now you have a last_sitemap.csv file as well as an output_file.jsonl you can create a cron job to run the second part periodically, for example:

crontab -e

Then add the following line to the end of the file:

@weekly /path/to/python evergreen_crawling.py

Learn more here about automating Python scripts if you are interested.

Scaling the process across multiple sitemaps and websites

The two main variables that we have here are the XML sitemap URL, and the name of the website. The name is something that you assign just for you to differentiate files. So, if your sitemap URL is https://example.com/sitemap.xml, the name could be “example”. This will be used to name this website’s files {name}_crawl.jsonl for example.

# Create a list of tuples of sitemap URLs and website names
sitemap_name = [
    ('https://example_1.com/sitemap.xml', 'example_1'),
    ('https://example_2.com/sitemap.xml', 'example_2'),
    ('https://example_3.com/sitemap.xml', 'example_3'),
]

# Name the full path of where the files are going to be stored
path = '/home/full/path/to/evergreen_crawling'

# Run a for-loop to go through all combinations and run the same process
for sitemap_url, name in sitemap_name:
    print(sitemap_url, name)
    try:
        last_sitemap = pd.read_csv(f'{path}/{name}_last_sitemap.csv')

        current_sitemap = adv.sitemap_to_df(sitemap_url)
        current_sitemap['lastmod'] = current_sitemap['lastmod'].astype(str)
        new_urls = set(current_sitemap['loc']).difference(last_sitemap['loc'])

        lastmod_changed_urls = (pd.merge(
            last_sitemap[['loc', 'lastmod']],
            current_sitemap[['loc', 'lastmod']]
            .rename(columns={'lastmod': 'lastmod2'}))
                                .assign(lastmod_changed=lambda df: df['lastmod2'].ne(df['lastmod']))
                                .query('lastmod_changed==True')
                                ['loc'].tolist())

        urls_to_crawl = lastmod_changed_urls + list(new_urls)
        if urls_to_crawl:
            adv.crawl(
                urls_to_crawl,
                f'{path}/{name}_crawl.jsonl',
                custom_settings={
                    'LOG_FILE': f'{name}_crawl.log',
                    'AUTOTHROTTLE_ENABLED': True,
                })
        current_sitemap.to_csv(f'{path}/{name}_last_sitemap.csv', index=False)
    except Exception as e:
        with open(f'{path}/evergreen_{name}_errors.txt', 'at') as file:
            print(str(e), file=file)

Automation

Save this code to a file, for example evergreen_crawling.py and you can run this on a daily basis:

@daily PATH=/path/to/env/bin; /path/to/env/bin/python /path/to/evergreen_crawling.py