This is a way to keep an updated crawl of a website by crawling the whole website once, and then only crawling pages that have changed. This approach uses XML sitemaps and assumes that they have accurate lastmod
tags, which will one of the two conditions to re-crawl URLs, the other being new URLs found in the sitemap.
The Plan
- Download a sitemap
- Crawl its URLs
- Save the sitemap to a CSV file
- After a month/week/day, download the same sitemap again
- Find URLs to crawl:
- New URLs not found in the
last_sitemap
- URLs that exist in both, but with a different
lastmod
- New URLs not found in the
- Crawl the new URLs
- Save the
current_sitemap
and overwrite (optionally rename) thelast_sitemap
- Repeat OR
- Scale
Download a sitemap
Steps listed here are to be done once, the first time we crawl the full list of URLs found in the XML sitemap.
Crawl the sitemap’s URLs
The example here shows a simple list-mode crawl with no options. You can customize this as you want of course.
Save the sitemap to disk
Periodic process
The following are the steps that are to be repeated periodically on a monthly, weekly, or even daily basis. First we read the last_sitemap
into a DataFrame.
Re-download the sitemap
Note that last_sitemap
is read from disk, and therefore the lastmod
column would be read as a string, and not as a datetime object. We can use the parse_dates
option available in the pandas.read_csv
function, but dates might have different representations from the one just downloaded, so in this case the easiest is to use both as strings.
So, now we convert the lastmod
column’s type to str
.
Find new URLs to crawl
- Find URLs in
current_sitemap
that are not inlast_sitemap
- Get URLs present in both DataFrames but have a different
lastmod
Combine the URLs to crawl and crawl them
Save the current_sitemap
to disk overwriting the old sitemap
Putting it all together
One-off sitemap and crawl code:
Periodic update
evergreen_crawling.py
sitemap_url = 'YOUR_SITEMAP_URL'
last_sitemap = pd.read_csv('last_sitemap.csv')
current_sitemap = adv.sitemap_to_df(sitemap_url)
current_sitemap['lastmod'] = current_sitemap['lastmod'].astype(str)
new_urls = set(current_sitemap['loc']).difference(last_sitemap['loc'])
lastmod_changed_urls = (pd.merge(
last_sitemap[['loc', 'lastmod']],
current_sitemap[['loc', 'lastmod']]
.rename(columns={'lastmod': 'lastmod2'}))
.assign(lastmod_changed=lambda df: df['lastmod2'].ne(df['lastmod']))
.query('lastmod_changed==True')
['loc'].tolist())
urls_to_crawl = lastmod_changed_urls + list(new_urls)
adv.crawl(urls_to_crawl, 'output_file.jl')
current_sitemap.to_csv('last_sitemap.csv', index=False)
Automating the process
Assuming you’ve done the first part, and now you have a last_sitemap.csv
file as well as an output_file.jsonl
you can create a cron job to run the second part periodically, for example:
Then add the following line to the end of the file:
Learn more here about automating Python scripts if you are interested.
Scaling the process across multiple sitemaps and websites
The two main variables that we have here are the XML sitemap URL, and the name of the website. The name is something that you assign just for you to differentiate files. So, if your sitemap URL is https://example.com/sitemap.xml, the name could be “example”. This will be used to name this website’s files {name}_crawl.jsonl
for example.
# Create a list of tuples of sitemap URLs and website names
sitemap_name = [
('https://example_1.com/sitemap.xml', 'example_1'),
('https://example_2.com/sitemap.xml', 'example_2'),
('https://example_3.com/sitemap.xml', 'example_3'),
]
# Name the full path of where the files are going to be stored
path = '/home/full/path/to/evergreen_crawling'
# Run a for-loop to go through all combinations and run the same process
for sitemap_url, name in sitemap_name:
print(sitemap_url, name)
try:
last_sitemap = pd.read_csv(f'{path}/{name}_last_sitemap.csv')
current_sitemap = adv.sitemap_to_df(sitemap_url)
current_sitemap['lastmod'] = current_sitemap['lastmod'].astype(str)
new_urls = set(current_sitemap['loc']).difference(last_sitemap['loc'])
lastmod_changed_urls = (pd.merge(
last_sitemap[['loc', 'lastmod']],
current_sitemap[['loc', 'lastmod']]
.rename(columns={'lastmod': 'lastmod2'}))
.assign(lastmod_changed=lambda df: df['lastmod2'].ne(df['lastmod']))
.query('lastmod_changed==True')
['loc'].tolist())
urls_to_crawl = lastmod_changed_urls + list(new_urls)
if urls_to_crawl:
adv.crawl(
urls_to_crawl,
f'{path}/{name}_crawl.jsonl',
custom_settings={
'LOG_FILE': f'{name}_crawl.log',
'AUTOTHROTTLE_ENABLED': True,
})
current_sitemap.to_csv(f'{path}/{name}_last_sitemap.csv', index=False)
except Exception as e:
with open(f'{path}/evergreen_{name}_errors.txt', 'at') as file:
print(str(e), file=file)
Automation
Save this code to a file, for example evergreen_crawling.py
and you can run this on a daily basis: