Setting custom request headers might be useful in a few scenarios, and I will highlight three of them here:
- Setting a custom
User-agent
- Fetching the sitemap if it has a different ETag from the last time it was fetched
- Fetching the sitemap if it has a different last modified date from the last time it was fetched
The syntax is simple, and can be done usingn a simple dictionary supplied to the request_headers
parameter.
import advertools as adv
adv.sitemap_to_df(
"https://example.com",
request_headers={"User-Agent": "YOUR CUSTOM USER AGENT"})
Setting a custom User-agent when fetching XML sitemaps
Sometimes a server might not accept your request, and you might want to try with another user agent to do so.
Fetching the sitemap only if its ETag
was changed
An ETag (entity tag) is a hash of the resource, and can therefore be a very efficient way of checking if a page has changed. Essentially, you are reducing a full resource (an XML sitemap) that could contain mega bytes, to a tiny string like this "42e16f95b870be16905efbeb68c28a87"
.
The flow:
- Fetch an XML sitemap with
sitemap_to_df
- Save its ETag from the
etag
column - After a while fetch the same sitemap using the saved ETag and the header
{"If-None-Match": "<SAVED_ETAG>"}
- If the ETag was not modified, you will get a 304: Not Modified status code, and the (unchanged) sitemap will not be fetched. Otherwise you will get a modified one.
loc | news | publication_name | publication_language | news_publication_date | news_title | news_keywords | news_genres | sitemap | etag | sitemap_last_modified | sitemap_size_mb | download_date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | https://www.ft.com/content/1fa1a199-a52d-4a97-... | \n | Financial Times | en | 2024-07-20T09:00:18.067Z | What happens if Joe Biden drops out of the rac... | US politics & policy, Joe Biden, US presidenti... | NaN | https://www.ft.com/sitemaps/news.xml | "42e16f95b870be16905efbeb68c28a87" | 2024-07-20 09:04:38 | 0.180689 | 2024-07-20 09:51:59.780749+00:00 |
1 | https://www.ft.com/content/e1a48ae5-300f-4d1c-... | \n | Financial Times | en | 2024-07-20T09:00:18.061Z | Why ‘no tax on tips’ has become a Trump electi... | Republican Party US, US tax, US politics & pol... | NaN | https://www.ft.com/sitemaps/news.xml | "42e16f95b870be16905efbeb68c28a87" | 2024-07-20 09:04:38 | 0.180689 | 2024-07-20 09:51:59.780749+00:00 |
2 | https://www.ft.com/content/65b74036-de3f-441d-... | \n | Financial Times | en | 2024-07-20T08:00:17.928Z | The $75bn rights battle reaches overtime | NBA, Comcast Corp, Paris Olympics, Warner Bros... | NaN | https://www.ft.com/sitemaps/news.xml | "42e16f95b870be16905efbeb68c28a87" | 2024-07-20 09:04:38 | 0.180689 | 2024-07-20 09:51:59.780749+00:00 |
3 | https://www.ft.com/content/3f26878d-b582-4e92-... | \n | Financial Times | en | 2024-07-20T07:00:17.853Z | Large parts of the US set to experience ‘dange... | Climate change, Environment, Science, Jana Tau... | NaN | https://www.ft.com/sitemaps/news.xml | "42e16f95b870be16905efbeb68c28a87" | 2024-07-20 09:04:38 | 0.180689 | 2024-07-20 09:51:59.780749+00:00 |
4 | https://www.ft.com/content/4b5f63e5-f962-4276-... | \n | Financial Times | en | 2024-07-20T04:35:23.516Z | Transcript: Swamp Notes — Trump pushes unity a... | NaN | NaN | https://www.ft.com/sitemaps/news.xml | "42e16f95b870be16905efbeb68c28a87" | 2024-07-20 09:04:38 | 0.180689 | 2024-07-20 09:51:59.780749+00:00 |
Trying again after a while:
try:
ft2 = adv.sitemap_to_df(
sitemap_url='https://www.ft.com/sitemaps/news.xml',
request_headers={'If-None-Match': etag})
etag = ft2['etag'][0]
except Exception as e:
print(etag, str(e))
"42e16f95b870be16905efbeb68c28a87" HTTP Error 304: Not Modified
Trying again, now it seems it was changed and we get the more recent version (with a brand new ETag that we also save to check again for the next time).
try:
ft2 = adv.sitemap_to_df(
sitemap_url='https://www.ft.com/sitemaps/news.xml',
request_headers={'If-None-Match': etag})
etag = ft2['etag'][0]
print(etag)
except Exception as e:
print(etag, str(e))
2024-07-20 19:14:13,988 | INFO | sitemaps.py:616 | sitemap_to_df | Getting https://www.ft.com/sitemaps/news.xml
"0acd7ce6203ba1a1991349f74d57b14d"
0 2024-07-20 09:04:38
1 2024-07-20 09:04:38
2 2024-07-20 09:04:38
3 2024-07-20 09:04:38
4 2024-07-20 09:04:38
...
359 2024-07-20 09:04:38
360 2024-07-20 09:04:38
361 2024-07-20 09:04:38
362 2024-07-20 09:04:38
363 2024-07-20 09:04:38
Name: sitemap_last_modified, Length: 364, dtype: datetime64[ns]
Fetching the sitemap only if its Last-Modified
date was changed
The Last-Modified
response header can serve as another way to check if a sitemap has changed or not. The difference is that it is based on the date, and not wheter the sitemap is different from the one you have (using ETag to check). Generally, it is good to check if the website you are working with implements this accurately. Many times it is not accurate, and cannot be reliably used in this scenario.
The flow:
- Fetch an XML sitemap with
sitemap_to_df
- Save its last modified header from the
sitemap_last_modified
column - After a while fetch the same sitemap using the saved sitemap_last_modified and the header
{"If-Modified-Since": "<SAVED_LAST_MODIFIED_DATE>"}
- If the sitemap was not modified since that date, you will get a
304: Not Modified
status code, and the (unchanged) sitemap will not be fetched. Otherwise you will get a modified one.
'2024-07-20 09:04:38'
Trying again after a while.
try:
ft3 = adv.sitemap_to_df(
'https://www.ft.com/sitemaps/news.xml',
request_headers={'If-Modified-Since': sitemap_last_modified})
sitemap_last_modified = ft2['sitemap_last_modified'][0]
except Exception as e:
print(sitemap_last_modified, str(e))
2024-07-20 09:04:38 HTTP Error 304: Not Modified
Trying again, this time, we get the updated sitemap.
try:
ft3 = adv.sitemap_to_df(
'https://www.ft.com/sitemaps/news.xml',
request_headers={'If-Modified-Since': sitemap_last_modified})
sitemap_last_modified = ft2['sitemap_last_modified'][0]
print(sitemap_last_modified)
except Exception as e:
print(sitemap_last_modified, str(e))
2024-07-20 19:14:57,737 | INFO | sitemaps.py:616 | sitemap_to_df | Getting https://www.ft.com/sitemaps/news.xml
2024-07-20 16:04:39
Please keep in mind that the sitemap_last_modified
column shows the Last-Modified
response header that was given by the server, and refers to the whole sitemap as a resource. This is not to be confused with the <lastmod>
tag that is unique to each URL listed in the sitemap.
These were three ways of using the request_headers
parameter, and you can combined them, and use any others of course.