Using the `meta` parameter while crawling - new in v0.16.0

adv.crawl(
    url_list="https://example.com",
    output_file="output_file.jsonl",
    meta={"foo": "bar"})

import advertools as adv
import pandas as pd
pd.options.display.max_columns = None
adv.__version__

'0.16.1'

What does this parameter do?

This parameter takes a simple dictionary, and can be used for the following purposes:

Arbitrary metadata
- Tracking/context information about the crawl (device used, timing, country, etc.)
Custom request headers per URL
- Set a different header(s) per URL, using the special key custom_headers, with a dictionary as it’s value:
  
  meta={'custom_headers': {'URL_1': {'header': 'value'}, 'URL_2': {'header': 'value'}}}
Integration with some 3rd party packages like playwright

Currently this provides some limited support for crawling JavaScript websites.
Using a proxy while crawling

Arbitrary metadata

You can create your own custom columns with any data you want for tracking, context, reporting, and so on. The keys will become column names, and their values will be their respective values in those columns.

For example running this code:

adv.crawl("https://example.com", "output.jsonl", meta={"color": "blue", "day": "Tuesday"})

will end up like this in the crawl DataFrame:

url	color	day
https://example.com	blue	Tuesday

If you crawl multiple URLs, each value under “color” will be “blue”, and for “day” all values would be “Tuesday”.

Some examples for using this feature:

User-agent human name (write “iPhone” instead of the full user agent string)
Crawl country when using proxies
Crawl timing (before/after launch/redirect/migration/new content etc.)
Custom settings you used like stopping the spider after a certain number of pages, or crawling only a certain section of the website, and so on.

Custom request headers per URL

Although you can easily set custom request headers with the custom_settings parameter, this works on all URLs crawled. In this case, you can set different headers per URL. This allows you to customize your crawling at a more granular level.

Some interesting use-cases for this feature

Re-crawl only pages that have changed since the last crawl. This requires the server to use the Etag response header, for which you can use the If-None-Match request header. With this approach you can keep a fresh and updated copy of the current website’s pages, without having to re-crawl every single page.
Re-crawl only pages that have changed using Last-Modified. This is the same as the previous option, but using a different header. In this case, you can use the If-Modified-Since request header.
Set a different user-agent per URL.

Some examples of how tis can be set

# generic example
meta={'custom_headers': {
    'URL_A': {"Header 1": "Value 1"},
    'URL_B': {"Header 1": "Value 2"},
    'URL_C': {"Header 1": "Value 3"},
}}

# practical simple example, using one request header
meta={'custom_headers': {
    'https://example.com/A': {"If-None-Match": "Etag 1"},
    'https://example.com/B': {"If-None-Match": "Etag 2"},
    'https://example.com/C': {"If-None-Match": "Etag 3"},
}}

# practical example showing multiple and different headers per URL:
meta={'custom_headers': {
    'https://example.com/A': {"User-Agent": "iPhone UA", "If-Modified-Since": "Mon, 19 Aug 2024 07:28:00 GMT"},
    'https://example.com/B': {"User-Agent": "Chrome Desktop UA", "If-None-Match": "Etag B"},
    'https://example.com/C': {"If-None-Match": "Etag C"},
}}

Custom headers examples

adv.crawl('https://blog.adver.tools', 'blog.jsonl')
crawldf = pd.read_json('blog.jsonl', lines=True)
crawldf[['url', 'resp_headers_Etag']]

	url	resp_headers_Etag
0	https://blog.adver.tools	W/"66be08d3-12726"

Let’s crawl again using this Etag in the If-None-Match request header

adv.crawl(
    'https://blog.adver.tools',
    'blog.jsonl', meta={
        'custom_headers': {
            'https://blog.adver.tools': {
                'If-None-Match': crawldf['resp_headers_Etag'][0]
            }}})

crawldf = pd.read_json('blog.jsonl', lines=True)
crawldf[['url', 'status', 'resp_headers_Etag', 'request_headers_If-None-Match', 'body_text']]

	url	status	resp_headers_Etag	request_headers_If-None-Match	body_text
0	https://blog.adver.tools	200	W/"66be08d3-12726"	NaN	\n \n \n \n \n Blog \n ...
1	https://blog.adver.tools	304	"66be08d3-12726"	W/"66be08d3-12726"

We can also use the If-Modified-Since header and only crawl the page if it was modified after that date.

adv.crawl(
    'https://blog.adver.tools',
    'blog.jsonl', meta={
        'custom_headers': {
            'https://blog.adver.tools': {
                'If-Modified-Since': crawldf['resp_headers_Last-Modified'][0]
            }}})

crawldf = pd.read_json('blog.jsonl', lines=True)
crawldf[['url', 'status', 'resp_headers_Etag', 'resp_headers_Last-Modified', 'request_headers_If-None-Match', 'request_headers_If-Modified-Since']]

	url	status	resp_headers_Etag	resp_headers_Last-Modified	request_headers_If-None-Match	request_headers_If-Modified-Since
0	https://blog.adver.tools	200	W/"66be08d3-12726"	Thu, 15 Aug 2024 13:55:31 GMT	NaN	NaN
1	https://blog.adver.tools	304	"66be08d3-12726"	Thu, 15 Aug 2024 13:55:31 GMT	W/"66be08d3-12726"	NaN
2	https://blog.adver.tools	304	"66be08d3-12726"	Thu, 15 Aug 2024 13:55:31 GMT	NaN	Thu, 15 Aug 2024 13:55:31 GMT

Let’s crawl the whole blog, and see how this can be used in bulk.

The initial crawl would not have any special settings to it. It is with the subsequent crawls that we start to set those based on what we received from previous crawls.

I’ll add arbitrary metadata as well just to demonstrate.

adv.crawl(
    'https://blog.adver.tools',
    'blog_full.jsonl',
    follow_links=True,
    meta={'crawl_type': 'initial/discovery'}
)

crawldf = pd.read_json('blog_full.jsonl', lines=True)
crawldf[['url', 'status', 'crawl_type', 'resp_headers_Etag', 'resp_headers_Last-Modified']].head()

	url	status	crawl_type	resp_headers_Etag	resp_headers_Last-Modified
0	https://blog.adver.tools	200	initial/discovery	W/"66be08d3-12726"	Thu, 15 Aug 2024 13:55:31 GMT
1	https://blog.adver.tools/posts/logfile-analysi...	200	initial/discovery	W/"66be08c1-9f3f"	Thu, 15 Aug 2024 13:55:13 GMT
2	https://blog.adver.tools/posts/sitemap-request...	200	initial/discovery	W/"66be08a1-b243"	Thu, 15 Aug 2024 13:54:41 GMT
3	https://blog.adver.tools/posts/evergreen-crawl...	200	initial/discovery	W/"66be08a3-c222"	Thu, 15 Aug 2024 13:54:43 GMT
4	https://blog.adver.tools/posts/work-using-code...	200	initial/discovery	W/"66be08bd-9506"	Thu, 15 Aug 2024 13:55:09 GMT

Since this server uses both, an Etag and a Last-Modified tag, we can use either option for subsequent crawls. Let’s crawl it again with using the Etag.

custom_headers = {}
for url, etag in crawldf[['url', 'resp_headers_Etag']].values:
    custom_headers[url] = {'If-None-Match': etag}
custom_headers

{'https://blog.adver.tools': {'If-None-Match': 'W/"66be08d3-12726"'},
 'https://blog.adver.tools/posts/logfile-analysis-cli/index.html': {'If-None-Match': 'W/"66be08c1-9f3f"'},
 'https://blog.adver.tools/posts/sitemap-request-headers/index.html': {'If-None-Match': 'W/"66be08a1-b243"'},
 'https://blog.adver.tools/posts/evergreen-crawling-xml-sitemaps/index.html': {'If-None-Match': 'W/"66be08a3-c222"'},
 'https://blog.adver.tools/posts/work-using-code/index.html': {'If-None-Match': 'W/"66be08bd-9506"'},
 'https://blog.adver.tools/about.html': {'If-None-Match': 'W/"66be08c6-58b8"'},
 'https://blog.adver.tools/index.html': {'If-None-Match': 'W/"66be08d3-12726"'},
 'https://blog.adver.tools/posts/prompt-engineering/index.html': {'If-None-Match': 'W/"66be08b0-12ae2"'},
 'https://blog.adver.tools/posts/internal-link-analysis/index.html': {'If-None-Match': 'W/"66be08ba-521583"'},
 'https://blog.adver.tools/posts/bulk-prompting/index.html': {'If-None-Match': 'W/"66be08bc-a12f"'},
 'https://blog.adver.tools/posts/word-co-occurrence-matrix/index.html': {'If-None-Match': 'W/"66be08e3-388c3b"'},
 'https://blog.adver.tools/posts/llm-content-evaluation/index.html': {'If-None-Match': 'W/"66be08bb-11899"'},
 'https://blog.adver.tools/posts/auditing-vs-analysis/index.html': {'If-None-Match': 'W/"66be08b1-64e4"'},
 'https://blog.adver.tools/posts/compare-crawls/index.html': {'If-None-Match': 'W/"66be08b7-38ad9f"'},
 'https://blog.adver.tools/posts/gsc-audit-analysis/slides/index.html': {'If-None-Match': 'W/"66be08c3-3b7a93"'},
 'https://blog.adver.tools/posts/programming-vs-software-dev/index.html': {'If-None-Match': 'W/"66be08b1-6cad"'},
 'https://blog.adver.tools/posts/invert-mapping/index.html': {'If-None-Match': 'W/"66be08b7-b498"'},
 'https://blog.adver.tools/posts/serp-analysis/index.html': {'If-None-Match': 'W/"66be08b5-399232"'},
 'https://blog.adver.tools/posts/llm-app-guidelines/index.html': {'If-None-Match': 'W/"66be08c0-836d"'},
 'https://blog.adver.tools/posts/automating-python-scripts/': {'If-None-Match': 'W/"66be08c2-af9c"'},
 'https://blog.adver.tools/posts/chart-zoom-in/index.html': {'If-None-Match': 'W/"66be08a6-39330e"'},
 'https://blog.adver.tools/posts/bodytext-xpath-selector/index.html': {'If-None-Match': 'W/"66be08a2-bef4"'},
 'https://blog.adver.tools/posts/automating-python-scripts/index.html': {'If-None-Match': 'W/"66be08c2-af9c"'},
 'https://blog.adver.tools/posts/sitemap-dashboard/index.html': {'If-None-Match': 'W/"66be08a8-49c989"'},
 'https://blog.adver.tools/posts/gsc-audit-analysis/index.html': {'If-None-Match': 'W/"66be08c5-3f3288"'},
 'https://blog.adver.tools/posts/risk-of-people-trusting-ai/index.html': {'If-None-Match': 'W/"66be08c5-676b"'},
 'https://blog.adver.tools/posts/crawl-file-structure/index.html': {'If-None-Match': 'W/"66be08a4-aedf"'},
 'https://blog.adver.tools/posts/bulk-prompting/': {'If-None-Match': 'W/"66be08bc-a12f"'},
 'https://blog.adver.tools/posts/external-links/index.html': {'If-None-Match': 'W/"66be08af-3cb9f7"'},
 'https://blog.adver.tools/_files/content_guidelines.csv': {'If-None-Match': '"66771410-15ba"'},
 'https://blog.adver.tools/posts/gsc-audit-analysis/': {'If-None-Match': 'W/"66be08c5-3f3288"'}}

This time, we will use both list mode and spider mode.

If we only supply the home page and set follow_links=True, and that page was not changed, then the crawler would not crawl that page, and crawling would stop. If some of the pages were modified or added, we wouldn’t know about them. This is why we will supply all available URLs, together with their respective Etags, and set follow_links=True.

adv.crawl(
    crawldf['url'],
    'blog_full.jsonl',
    follow_links=True,
    meta={
        'crawl_type': 'update',
        'custom_headers': custom_headers
    })

crawldf = pd.read_json('blog_full.jsonl', lines=True)
crawldf['status'].value_counts()

status
200    31
304    31
Name: count, dtype: int64

We ran a second crawl that took a negligible amount of time, didn’t consume resources, and we now have a better idea about which, and how many pages were updated. Since we ran two crawls within minutes, obviously no pages were changed, and that’s why we have the same number of 200 and 304 (Not Modified) status codes.

Setting custom headers for a very long list of URLs

The dictionary we used was fairly small, and we encountered no issues. If we have several thousands of custom headers, we might get an error. The solution to this is very simple, instead of a dictionary, we supply the path of a Python script that contains the dictionary. The only condition is that the variable (dictionary) be named custom_headers. This can also help we when we want the custom headers to be generated dynamically, as part of a larger workflow, and so on.

my_custom_headers.py

import pandas as pd

crawldf = pd.read_json('blog_full.jsonl', lines=True)
custom_headers = {}
for url, etag in crawldf[['url', 'resp_headers_Etag']].values:
    custom_headers[url] = {'If-None-Match': etag}

The code would have to be run as follows:

adv.crawl(
    crawldf['url'],
    'blog_full.jsonl',
    follow_links=True,
    meta={
        'crawl_type': 'update',
        'custom_headers': "my_custom_headers.py" 
    })

Basic JavaScript rendering using 3rd party plugins like `playwright`

There is limited support for rendering JS in some cases, and the setup can be done as follows.

Let’s first try normal crawling and try to extract the quote text from this page

adv.crawl(
    "https://quotes.toscrape.com/js/",
    "quotes_js.jsonl",
    xpath_selectors={'quote_text': "//span[@class='text']/text()"},
)

crawldf = pd.read_json('quotes_js.jsonl', lines=True)
'quote_text' in crawldf

False

No data was extracted, and the desired column was not created.

Trying again with scrapy-playwright, but first we need to install it:

pip install scrapy-playwright
playwright install

adv.crawl(
    "https://quotes.toscrape.com/js/",
    "quotes_js.jsonl",
    meta={'playwright': True},
    xpath_selectors={'quote_text': "//span[@class='text']/text()"},
    custom_settings={
        "DOWNLOAD_HANDLERS": {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
        },
        "TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
    }
)

crawldf = pd.read_json('quotes_js.jsonl', lines=True)
crawldf['quote_text']

0                                                  NaN
1    “The world as we have created it is a process ...
Name: quote_text, dtype: object

crawldf['quote_text'].str.split('@@').explode().tolist()

[nan,
 '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
 '“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
 '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
 '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”',
 "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",
 '“Try not to become a man of success. Rather become a man of value.”',
 '“It is better to be hated for what you are than to be loved for what you are not.”',
 "“I have not failed. I've just found 10,000 ways that won't work.”",
 "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”",
 '“A day without sunshine is like, you know, night.”']

Using a proxy while crawling

You can also supply the URL of a proxy server to the meta parameter, through the proxy key, which is as simple as this:

adv.crawl(
    "https://example.com",
    "output_file.jsonl",
    meta={"proxy": "https://username:password@someproxy.com:10000"})

crawldf = pd.read_json('example_proxy.jsonl', lines=True)
crawldf[['proxy', 'download_slot', '_auth_proxy', 'request_headers_Proxy-Authorization']]

	proxy	download_slot	_auth_proxy	request_headers_Proxy-Authorization
0	https://someproxy.com:10000	example.com	https://someproxy.com:10000	Basic b4N3ZnJrxUI1Zjo3ADY3Z3qsb2k9cU5QTkE3Y4U=

Using rotating proxies

The above solution uses a single proxy, and sometimes you might want to use a random proxy out of a list of proxies. You can check out the recipe for rotating proxies. This depends on the separate library scrapy-rotating-proxies

Using the crawl meta parameter while crawling

Using the `meta` parameter while crawling - new in v0.16.0

What does this parameter do?

Arbitrary metadata

Custom request headers per URL

Integration with some 3rd party packages like `playwright`

Using a proxy while crawling

Arbitrary metadata

Custom request headers per URL

Some interesting use-cases for this feature

Custom headers examples

Setting custom headers for a very long list of URLs

Basic JavaScript rendering using 3rd party plugins like `playwright`

Using a proxy while crawling

Using rotating proxies

Using the meta parameter while crawling - new in v0.16.0

What does this parameter do?

Arbitrary metadata

Custom request headers per URL

Integration with some 3rd party packages like playwright

Using a proxy while crawling

Arbitrary metadata

Custom request headers per URL

Some interesting use-cases for this feature

Custom headers examples

Setting custom headers for a very long list of URLs

Basic JavaScript rendering using 3rd party plugins like playwright

Using a proxy while crawling

Using rotating proxies

Using the `meta` parameter while crawling - new in v0.16.0

Integration with some 3rd party packages like `playwright`

Basic JavaScript rendering using 3rd party plugins like `playwright`