XML Sitemap Request Headers

Setting custom request headers while fetching and parsing XML sitemaps.
sitemaps
advertools
v0.15
Author

Elias Dabbas

Published

July 20, 2024

Setting custom request headers might be useful in a few scenarios, and I will highlight three of them here:

The syntax is simple, and can be done usingn a simple dictionary supplied to the request_headers parameter.

import advertools as adv
adv.sitemap_to_df(
    "https://example.com",
    request_headers={"User-Agent": "YOUR CUSTOM USER AGENT"})

Setting a custom User-agent when fetching XML sitemaps

Sometimes a server might not accept your request, and you might want to try with another user agent to do so.

import advertools as adv
import pandas as pd
win_chrome_ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36'

ft = adv.sitemap_to_df(
    sitemap_url='https://www.ft.com/sitemaps/news.xml',
    request_headers={'User-Agent': win_chrome_ua})

Fetching the sitemap only if its ETag was changed

An ETag (entity tag) is a hash of the resource, and can therefore be a very efficient way of checking if a page has changed. Essentially, you are reducing a full resource (an XML sitemap) that could contain mega bytes, to a tiny string like this "42e16f95b870be16905efbeb68c28a87".

The flow:

  1. Fetch an XML sitemap with sitemap_to_df
  2. Save its ETag from the etag column
  3. After a while fetch the same sitemap using the saved ETag and the header {"If-None-Match": "<SAVED_ETAG>"}
  4. If the ETag was not modified, you will get a 304: Not Modified status code, and the (unchanged) sitemap will not be fetched. Otherwise you will get a modified one.
# ft = adv.sitemap_to_df('https://www.ft.com/sitemaps/news.xml')
ft.head()
loc news publication_name publication_language news_publication_date news_title news_keywords news_genres sitemap etag sitemap_last_modified sitemap_size_mb download_date
0 https://www.ft.com/content/1fa1a199-a52d-4a97-... \n Financial Times en 2024-07-20T09:00:18.067Z What happens if Joe Biden drops out of the rac... US politics & policy, Joe Biden, US presidenti... NaN https://www.ft.com/sitemaps/news.xml "42e16f95b870be16905efbeb68c28a87" 2024-07-20 09:04:38 0.180689 2024-07-20 09:51:59.780749+00:00
1 https://www.ft.com/content/e1a48ae5-300f-4d1c-... \n Financial Times en 2024-07-20T09:00:18.061Z Why ‘no tax on tips’ has become a Trump electi... Republican Party US, US tax, US politics & pol... NaN https://www.ft.com/sitemaps/news.xml "42e16f95b870be16905efbeb68c28a87" 2024-07-20 09:04:38 0.180689 2024-07-20 09:51:59.780749+00:00
2 https://www.ft.com/content/65b74036-de3f-441d-... \n Financial Times en 2024-07-20T08:00:17.928Z The $75bn rights battle reaches overtime NBA, Comcast Corp, Paris Olympics, Warner Bros... NaN https://www.ft.com/sitemaps/news.xml "42e16f95b870be16905efbeb68c28a87" 2024-07-20 09:04:38 0.180689 2024-07-20 09:51:59.780749+00:00
3 https://www.ft.com/content/3f26878d-b582-4e92-... \n Financial Times en 2024-07-20T07:00:17.853Z Large parts of the US set to experience ‘dange... Climate change, Environment, Science, Jana Tau... NaN https://www.ft.com/sitemaps/news.xml "42e16f95b870be16905efbeb68c28a87" 2024-07-20 09:04:38 0.180689 2024-07-20 09:51:59.780749+00:00
4 https://www.ft.com/content/4b5f63e5-f962-4276-... \n Financial Times en 2024-07-20T04:35:23.516Z Transcript: Swamp Notes — Trump pushes unity a... NaN NaN https://www.ft.com/sitemaps/news.xml "42e16f95b870be16905efbeb68c28a87" 2024-07-20 09:04:38 0.180689 2024-07-20 09:51:59.780749+00:00
etag = ft['etag'][0]
print(etag)
"42e16f95b870be16905efbeb68c28a87"

Trying again after a while:

try:
    ft2 = adv.sitemap_to_df(
        sitemap_url='https://www.ft.com/sitemaps/news.xml',
        request_headers={'If-None-Match': etag})
    etag = ft2['etag'][0]
except Exception as e:
    print(etag, str(e))
"42e16f95b870be16905efbeb68c28a87" HTTP Error 304: Not Modified

Trying again, now it seems it was changed and we get the more recent version (with a brand new ETag that we also save to check again for the next time).

try:
    ft2 = adv.sitemap_to_df(
        sitemap_url='https://www.ft.com/sitemaps/news.xml',
        request_headers={'If-None-Match': etag})
    etag = ft2['etag'][0]
    print(etag)
except Exception as e:
    print(etag, str(e))
2024-07-20 19:14:13,988 | INFO | sitemaps.py:616 | sitemap_to_df | Getting https://www.ft.com/sitemaps/news.xml
"0acd7ce6203ba1a1991349f74d57b14d"
ft['sitemap_last_modified']
0     2024-07-20 09:04:38
1     2024-07-20 09:04:38
2     2024-07-20 09:04:38
3     2024-07-20 09:04:38
4     2024-07-20 09:04:38
              ...        
359   2024-07-20 09:04:38
360   2024-07-20 09:04:38
361   2024-07-20 09:04:38
362   2024-07-20 09:04:38
363   2024-07-20 09:04:38
Name: sitemap_last_modified, Length: 364, dtype: datetime64[ns]

Fetching the sitemap only if its Last-Modified date was changed

The Last-Modified response header can serve as another way to check if a sitemap has changed or not. The difference is that it is based on the date, and not wheter the sitemap is different from the one you have (using ETag to check). Generally, it is good to check if the website you are working with implements this accurately. Many times it is not accurate, and cannot be reliably used in this scenario.

The flow:

  1. Fetch an XML sitemap with sitemap_to_df
  2. Save its last modified header from the sitemap_last_modified column
  3. After a while fetch the same sitemap using the saved sitemap_last_modified and the header {"If-Modified-Since": "<SAVED_LAST_MODIFIED_DATE>"}
  4. If the sitemap was not modified since that date, you will get a 304: Not Modified status code, and the (unchanged) sitemap will not be fetched. Otherwise you will get a modified one.
sitemap_last_modified = str(ft['sitemap_last_modified'][0])
sitemap_last_modified
'2024-07-20 09:04:38'

Trying again after a while.

try:
    ft3 = adv.sitemap_to_df(
        'https://www.ft.com/sitemaps/news.xml',
        request_headers={'If-Modified-Since': sitemap_last_modified})
    sitemap_last_modified = ft2['sitemap_last_modified'][0]
except Exception as e:
    print(sitemap_last_modified, str(e))
2024-07-20 09:04:38 HTTP Error 304: Not Modified

Trying again, this time, we get the updated sitemap.

try:
    ft3 = adv.sitemap_to_df(
        'https://www.ft.com/sitemaps/news.xml',
        request_headers={'If-Modified-Since': sitemap_last_modified})
    sitemap_last_modified = ft2['sitemap_last_modified'][0]
    print(sitemap_last_modified)
except Exception as e:
    print(sitemap_last_modified, str(e))
2024-07-20 19:14:57,737 | INFO | sitemaps.py:616 | sitemap_to_df | Getting https://www.ft.com/sitemaps/news.xml
2024-07-20 16:04:39
Note

Please keep in mind that the sitemap_last_modified column shows the Last-Modified response header that was given by the server, and refers to the whole sitemap as a resource. This is not to be confused with the <lastmod> tag that is unique to each URL listed in the sitemap.

These were three ways of using the request_headers parameter, and you can combined them, and use any others of course.