Using the crawl meta parameter while crawling

An overview of how to use the new meta parameter while crawling with advertools. Set arbitrary metadata, set custom request headers per URL, and limited support for crawling JavaScript websites.
crawling
scraping
advertools
scrapy
Author

Elias Dabbas

Published

August 19, 2024

Using the meta parameter while crawling - new in v0.16.0

adv.crawl(
    url_list="https://example.com",
    output_file="output_file.jsonl",
    meta={"foo": "bar"}) 
import advertools as adv
import pandas as pd
pd.options.display.max_columns = None
adv.__version__
'0.16.1'

What does this parameter do?

This parameter takes a simple dictionary, and can be used for the following purposes:

  • Arbitrary metadata

    • Tracking/context information about the crawl (device used, timing, country, etc.)
  • Custom request headers per URL

    • Set a different header(s) per URL, using the special key custom_headers, with a dictionary as it’s value:

      meta={'custom_headers': {'URL_1': {'header': 'value'}, 'URL_2': {'header': 'value'}}}

  • Integration with some 3rd party packages like playwright

    Currently this provides some limited support for crawling JavaScript websites.

  • Using a proxy while crawling

Arbitrary metadata

You can create your own custom columns with any data you want for tracking, context, reporting, and so on. The keys will become column names, and their values will be their respective values in those columns.

For example running this code:

adv.crawl("https://example.com", "output.jsonl", meta={"color": "blue", "day": "Tuesday"})

will end up like this in the crawl DataFrame:

url color day
https://example.com blue Tuesday

If you crawl multiple URLs, each value under “color” will be “blue”, and for “day” all values would be “Tuesday”.

Some examples for using this feature:

  • User-agent human name (write “iPhone” instead of the full user agent string)
  • Crawl country when using proxies
  • Crawl timing (before/after launch/redirect/migration/new content etc.)
  • Custom settings you used like stopping the spider after a certain number of pages, or crawling only a certain section of the website, and so on.

Custom request headers per URL

Although you can easily set custom request headers with the custom_settings parameter, this works on all URLs crawled. In this case, you can set different headers per URL. This allows you to customize your crawling at a more granular level.

Some interesting use-cases for this feature

  • Re-crawl only pages that have changed since the last crawl. This requires the server to use the Etag response header, for which you can use the If-None-Match request header. With this approach you can keep a fresh and updated copy of the current website’s pages, without having to re-crawl every single page.
  • Re-crawl only pages that have changed using Last-Modified. This is the same as the previous option, but using a different header. In this case, you can use the If-Modified-Since request header.
  • Set a different user-agent per URL.

Some examples of how tis can be set

# generic example
meta={'custom_headers': {
    'URL_A': {"Header 1": "Value 1"},
    'URL_B': {"Header 1": "Value 2"},
    'URL_C': {"Header 1": "Value 3"},
}}

# practical simple example, using one request header
meta={'custom_headers': {
    'https://example.com/A': {"If-None-Match": "Etag 1"},
    'https://example.com/B': {"If-None-Match": "Etag 2"},
    'https://example.com/C': {"If-None-Match": "Etag 3"},
}}

# practical example showing multiple and different headers per URL:
meta={'custom_headers': {
    'https://example.com/A': {"User-Agent": "iPhone UA", "If-Modified-Since": "Mon, 19 Aug 2024 07:28:00 GMT"},
    'https://example.com/B': {"User-Agent": "Chrome Desktop UA", "If-None-Match": "Etag B"},
    'https://example.com/C': {"If-None-Match": "Etag C"},
}}

Custom headers examples

adv.crawl('https://blog.adver.tools', 'blog.jsonl')
crawldf = pd.read_json('blog.jsonl', lines=True)
crawldf[['url', 'resp_headers_Etag']]
url resp_headers_Etag
0 https://blog.adver.tools W/"66be08d3-12726"

Let’s crawl again using this Etag in the If-None-Match request header

adv.crawl(
    'https://blog.adver.tools',
    'blog.jsonl', meta={
        'custom_headers': {
            'https://blog.adver.tools': {
                'If-None-Match': crawldf['resp_headers_Etag'][0]
            }}})
crawldf = pd.read_json('blog.jsonl', lines=True)
crawldf[['url', 'status', 'resp_headers_Etag', 'request_headers_If-None-Match', 'body_text']]
url status resp_headers_Etag request_headers_If-None-Match body_text
0 https://blog.adver.tools 200 W/"66be08d3-12726" NaN \n \n \n \n \n Blog \n ...
1 https://blog.adver.tools 304 "66be08d3-12726" W/"66be08d3-12726"

We can also use the If-Modified-Since header and only crawl the page if it was modified after that date.

adv.crawl(
    'https://blog.adver.tools',
    'blog.jsonl', meta={
        'custom_headers': {
            'https://blog.adver.tools': {
                'If-Modified-Since': crawldf['resp_headers_Last-Modified'][0]
            }}})
crawldf = pd.read_json('blog.jsonl', lines=True)
crawldf[['url', 'status', 'resp_headers_Etag', 'resp_headers_Last-Modified', 'request_headers_If-None-Match', 'request_headers_If-Modified-Since']]
url status resp_headers_Etag resp_headers_Last-Modified request_headers_If-None-Match request_headers_If-Modified-Since
0 https://blog.adver.tools 200 W/"66be08d3-12726" Thu, 15 Aug 2024 13:55:31 GMT NaN NaN
1 https://blog.adver.tools 304 "66be08d3-12726" Thu, 15 Aug 2024 13:55:31 GMT W/"66be08d3-12726" NaN
2 https://blog.adver.tools 304 "66be08d3-12726" Thu, 15 Aug 2024 13:55:31 GMT NaN Thu, 15 Aug 2024 13:55:31 GMT

Let’s crawl the whole blog, and see how this can be used in bulk.

The initial crawl would not have any special settings to it. It is with the subsequent crawls that we start to set those based on what we received from previous crawls.

I’ll add arbitrary metadata as well just to demonstrate.

adv.crawl(
    'https://blog.adver.tools',
    'blog_full.jsonl',
    follow_links=True,
    meta={'crawl_type': 'initial/discovery'}
)
crawldf = pd.read_json('blog_full.jsonl', lines=True)
crawldf[['url', 'status', 'crawl_type', 'resp_headers_Etag', 'resp_headers_Last-Modified']].head()
url status crawl_type resp_headers_Etag resp_headers_Last-Modified
0 https://blog.adver.tools 200 initial/discovery W/"66be08d3-12726" Thu, 15 Aug 2024 13:55:31 GMT
1 https://blog.adver.tools/posts/logfile-analysi... 200 initial/discovery W/"66be08c1-9f3f" Thu, 15 Aug 2024 13:55:13 GMT
2 https://blog.adver.tools/posts/sitemap-request... 200 initial/discovery W/"66be08a1-b243" Thu, 15 Aug 2024 13:54:41 GMT
3 https://blog.adver.tools/posts/evergreen-crawl... 200 initial/discovery W/"66be08a3-c222" Thu, 15 Aug 2024 13:54:43 GMT
4 https://blog.adver.tools/posts/work-using-code... 200 initial/discovery W/"66be08bd-9506" Thu, 15 Aug 2024 13:55:09 GMT

Since this server uses both, an Etag and a Last-Modified tag, we can use either option for subsequent crawls. Let’s crawl it again with using the Etag.

custom_headers = {}
for url, etag in crawldf[['url', 'resp_headers_Etag']].values:
    custom_headers[url] = {'If-None-Match': etag}
custom_headers
{'https://blog.adver.tools': {'If-None-Match': 'W/"66be08d3-12726"'},
 'https://blog.adver.tools/posts/logfile-analysis-cli/index.html': {'If-None-Match': 'W/"66be08c1-9f3f"'},
 'https://blog.adver.tools/posts/sitemap-request-headers/index.html': {'If-None-Match': 'W/"66be08a1-b243"'},
 'https://blog.adver.tools/posts/evergreen-crawling-xml-sitemaps/index.html': {'If-None-Match': 'W/"66be08a3-c222"'},
 'https://blog.adver.tools/posts/work-using-code/index.html': {'If-None-Match': 'W/"66be08bd-9506"'},
 'https://blog.adver.tools/about.html': {'If-None-Match': 'W/"66be08c6-58b8"'},
 'https://blog.adver.tools/index.html': {'If-None-Match': 'W/"66be08d3-12726"'},
 'https://blog.adver.tools/posts/prompt-engineering/index.html': {'If-None-Match': 'W/"66be08b0-12ae2"'},
 'https://blog.adver.tools/posts/internal-link-analysis/index.html': {'If-None-Match': 'W/"66be08ba-521583"'},
 'https://blog.adver.tools/posts/bulk-prompting/index.html': {'If-None-Match': 'W/"66be08bc-a12f"'},
 'https://blog.adver.tools/posts/word-co-occurrence-matrix/index.html': {'If-None-Match': 'W/"66be08e3-388c3b"'},
 'https://blog.adver.tools/posts/llm-content-evaluation/index.html': {'If-None-Match': 'W/"66be08bb-11899"'},
 'https://blog.adver.tools/posts/auditing-vs-analysis/index.html': {'If-None-Match': 'W/"66be08b1-64e4"'},
 'https://blog.adver.tools/posts/compare-crawls/index.html': {'If-None-Match': 'W/"66be08b7-38ad9f"'},
 'https://blog.adver.tools/posts/gsc-audit-analysis/slides/index.html': {'If-None-Match': 'W/"66be08c3-3b7a93"'},
 'https://blog.adver.tools/posts/programming-vs-software-dev/index.html': {'If-None-Match': 'W/"66be08b1-6cad"'},
 'https://blog.adver.tools/posts/invert-mapping/index.html': {'If-None-Match': 'W/"66be08b7-b498"'},
 'https://blog.adver.tools/posts/serp-analysis/index.html': {'If-None-Match': 'W/"66be08b5-399232"'},
 'https://blog.adver.tools/posts/llm-app-guidelines/index.html': {'If-None-Match': 'W/"66be08c0-836d"'},
 'https://blog.adver.tools/posts/automating-python-scripts/': {'If-None-Match': 'W/"66be08c2-af9c"'},
 'https://blog.adver.tools/posts/chart-zoom-in/index.html': {'If-None-Match': 'W/"66be08a6-39330e"'},
 'https://blog.adver.tools/posts/bodytext-xpath-selector/index.html': {'If-None-Match': 'W/"66be08a2-bef4"'},
 'https://blog.adver.tools/posts/automating-python-scripts/index.html': {'If-None-Match': 'W/"66be08c2-af9c"'},
 'https://blog.adver.tools/posts/sitemap-dashboard/index.html': {'If-None-Match': 'W/"66be08a8-49c989"'},
 'https://blog.adver.tools/posts/gsc-audit-analysis/index.html': {'If-None-Match': 'W/"66be08c5-3f3288"'},
 'https://blog.adver.tools/posts/risk-of-people-trusting-ai/index.html': {'If-None-Match': 'W/"66be08c5-676b"'},
 'https://blog.adver.tools/posts/crawl-file-structure/index.html': {'If-None-Match': 'W/"66be08a4-aedf"'},
 'https://blog.adver.tools/posts/bulk-prompting/': {'If-None-Match': 'W/"66be08bc-a12f"'},
 'https://blog.adver.tools/posts/external-links/index.html': {'If-None-Match': 'W/"66be08af-3cb9f7"'},
 'https://blog.adver.tools/_files/content_guidelines.csv': {'If-None-Match': '"66771410-15ba"'},
 'https://blog.adver.tools/posts/gsc-audit-analysis/': {'If-None-Match': 'W/"66be08c5-3f3288"'}}

This time, we will use both list mode and spider mode.

If we only supply the home page and set follow_links=True, and that page was not changed, then the crawler would not crawl that page, and crawling would stop. If some of the pages were modified or added, we wouldn’t know about them. This is why we will supply all available URLs, together with their respective Etags, and set follow_links=True.

adv.crawl(
    crawldf['url'],
    'blog_full.jsonl',
    follow_links=True,
    meta={
        'crawl_type': 'update',
        'custom_headers': custom_headers
    })
crawldf = pd.read_json('blog_full.jsonl', lines=True)
crawldf['status'].value_counts()
status
200    31
304    31
Name: count, dtype: int64

We ran a second crawl that took a negligible amount of time, didn’t consume resources, and we now have a better idea about which, and how many pages were updated. Since we ran two crawls within minutes, obviously no pages were changed, and that’s why we have the same number of 200 and 304 (Not Modified) status codes.

Setting custom headers for a very long list of URLs

The dictionary we used was fairly small, and we encountered no issues. If we have several thousands of custom headers, we might get an error. The solution to this is very simple, instead of a dictionary, we supply the path of a Python script that contains the dictionary. The only condition is that the variable (dictionary) be named custom_headers. This can also help we when we want the custom headers to be generated dynamically, as part of a larger workflow, and so on.

my_custom_headers.py
import pandas as pd

crawldf = pd.read_json('blog_full.jsonl', lines=True)
custom_headers = {}
for url, etag in crawldf[['url', 'resp_headers_Etag']].values:
    custom_headers[url] = {'If-None-Match': etag}

The code would have to be run as follows:

adv.crawl(
    crawldf['url'],
    'blog_full.jsonl',
    follow_links=True,
    meta={
        'crawl_type': 'update',
        'custom_headers': "my_custom_headers.py" 
    })

Basic JavaScript rendering using 3rd party plugins like playwright

There is limited support for rendering JS in some cases, and the setup can be done as follows.

Let’s first try normal crawling and try to extract the quote text from this page

adv.crawl(
    "https://quotes.toscrape.com/js/",
    "quotes_js.jsonl",
    xpath_selectors={'quote_text': "//span[@class='text']/text()"},
)
crawldf = pd.read_json('quotes_js.jsonl', lines=True)
'quote_text' in crawldf
False

No data was extracted, and the desired column was not created.

Trying again with scrapy-playwright, but first we need to install it:

pip install scrapy-playwright
playwright install
adv.crawl(
    "https://quotes.toscrape.com/js/",
    "quotes_js.jsonl",
    meta={'playwright': True},
    xpath_selectors={'quote_text': "//span[@class='text']/text()"},
    custom_settings={
        "DOWNLOAD_HANDLERS": {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
    }
)
crawldf = pd.read_json('quotes_js.jsonl', lines=True)
crawldf['quote_text']
0                                                  NaN
1    “The world as we have created it is a process ...
Name: quote_text, dtype: object
crawldf['quote_text'].str.split('@@').explode().tolist()
[nan,
 '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
 '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
 '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
 '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”',
 "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",
 '“Try not to become a man of success. Rather become a man of value.”',
 '“It is better to be hated for what you are than to be loved for what you are not.”',
 "“I have not failed. I've just found 10,000 ways that won't work.”",
 "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”",
 '“A day without sunshine is like, you know, night.”']

Using a proxy while crawling

You can also supply the URL of a proxy server to the meta parameter, through the proxy key, which is as simple as this:

adv.crawl(
    "https://example.com",
    "output_file.jsonl",
    meta={"proxy": "https://username:password@someproxy.com:10000"})

crawldf = pd.read_json('example_proxy.jsonl', lines=True)
crawldf[['proxy', 'download_slot', '_auth_proxy', 'request_headers_Proxy-Authorization']]
proxy download_slot _auth_proxy request_headers_Proxy-Authorization
0 https://someproxy.com:10000 example.com https://someproxy.com:10000 Basic b4N3ZnJrxUI1Zjo3ADY3Z3qsb2k9cU5QTkE3Y4U=

Using rotating proxies

The above solution uses a single proxy, and sometimes you might want to use a random proxy out of a list of proxies. You can check out the recipe for rotating proxies. This depends on the separate library scrapy-rotating-proxies