Using the meta
parameter while crawling - new in v0.16.0
'0.16.1'
What does this parameter do?
This parameter takes a simple dictionary, and can be used for the following purposes:
Arbitrary metadata
- Tracking/context information about the crawl (device used, timing, country, etc.)
Custom request headers per URL
Set a different header(s) per URL, using the special key
custom_headers
, with a dictionary as it’s value:meta={'custom_headers': {'URL_1': {'header': 'value'}, 'URL_2': {'header': 'value'}}}
Integration with some 3rd party packages like
playwright
Currently this provides some limited support for crawling JavaScript websites.
Using a proxy while crawling
Arbitrary metadata
You can create your own custom columns with any data you want for tracking, context, reporting, and so on. The keys will become column names, and their values will be their respective values in those columns.
For example running this code:
will end up like this in the crawl DataFrame:
url | color | day |
---|---|---|
https://example.com | blue | Tuesday |
If you crawl multiple URLs, each value under “color” will be “blue”, and for “day” all values would be “Tuesday”.
Some examples for using this feature:
- User-agent human name (write “iPhone” instead of the full user agent string)
- Crawl country when using proxies
- Crawl timing (before/after launch/redirect/migration/new content etc.)
- Custom settings you used like stopping the spider after a certain number of pages, or crawling only a certain section of the website, and so on.
Custom request headers per URL
Although you can easily set custom request headers with the custom_settings
parameter, this works on all URLs crawled. In this case, you can set different headers per URL. This allows you to customize your crawling at a more granular level.
Some interesting use-cases for this feature
- Re-crawl only pages that have changed since the last crawl. This requires the server to use the Etag response header, for which you can use the
If-None-Match
request header. With this approach you can keep a fresh and updated copy of the current website’s pages, without having to re-crawl every single page. - Re-crawl only pages that have changed using
Last-Modified
. This is the same as the previous option, but using a different header. In this case, you can use theIf-Modified-Since
request header. - Set a different user-agent per URL.
Some examples of how tis can be set
# generic example
meta={'custom_headers': {
'URL_A': {"Header 1": "Value 1"},
'URL_B': {"Header 1": "Value 2"},
'URL_C': {"Header 1": "Value 3"},
}}
# practical simple example, using one request header
meta={'custom_headers': {
'https://example.com/A': {"If-None-Match": "Etag 1"},
'https://example.com/B': {"If-None-Match": "Etag 2"},
'https://example.com/C': {"If-None-Match": "Etag 3"},
}}
# practical example showing multiple and different headers per URL:
meta={'custom_headers': {
'https://example.com/A': {"User-Agent": "iPhone UA", "If-Modified-Since": "Mon, 19 Aug 2024 07:28:00 GMT"},
'https://example.com/B': {"User-Agent": "Chrome Desktop UA", "If-None-Match": "Etag B"},
'https://example.com/C': {"If-None-Match": "Etag C"},
}}
Custom headers examples
adv.crawl('https://blog.adver.tools', 'blog.jsonl')
crawldf = pd.read_json('blog.jsonl', lines=True)
crawldf[['url', 'resp_headers_Etag']]
url | resp_headers_Etag | |
---|---|---|
0 | https://blog.adver.tools | W/"66be08d3-12726" |
Let’s crawl again using this Etag in the If-None-Match
request header
crawldf = pd.read_json('blog.jsonl', lines=True)
crawldf[['url', 'status', 'resp_headers_Etag', 'request_headers_If-None-Match', 'body_text']]
url | status | resp_headers_Etag | request_headers_If-None-Match | body_text | |
---|---|---|---|---|---|
0 | https://blog.adver.tools | 200 | W/"66be08d3-12726" | NaN | \n \n \n \n \n Blog \n ... |
1 | https://blog.adver.tools | 304 | "66be08d3-12726" | W/"66be08d3-12726" |
We can also use the If-Modified-Since
header and only crawl the page if it was modified after that date.
crawldf = pd.read_json('blog.jsonl', lines=True)
crawldf[['url', 'status', 'resp_headers_Etag', 'resp_headers_Last-Modified', 'request_headers_If-None-Match', 'request_headers_If-Modified-Since']]
url | status | resp_headers_Etag | resp_headers_Last-Modified | request_headers_If-None-Match | request_headers_If-Modified-Since | |
---|---|---|---|---|---|---|
0 | https://blog.adver.tools | 200 | W/"66be08d3-12726" | Thu, 15 Aug 2024 13:55:31 GMT | NaN | NaN |
1 | https://blog.adver.tools | 304 | "66be08d3-12726" | Thu, 15 Aug 2024 13:55:31 GMT | W/"66be08d3-12726" | NaN |
2 | https://blog.adver.tools | 304 | "66be08d3-12726" | Thu, 15 Aug 2024 13:55:31 GMT | NaN | Thu, 15 Aug 2024 13:55:31 GMT |
Let’s crawl the whole blog, and see how this can be used in bulk.
The initial crawl would not have any special settings to it. It is with the subsequent crawls that we start to set those based on what we received from previous crawls.
I’ll add arbitrary metadata as well just to demonstrate.
crawldf = pd.read_json('blog_full.jsonl', lines=True)
crawldf[['url', 'status', 'crawl_type', 'resp_headers_Etag', 'resp_headers_Last-Modified']].head()
url | status | crawl_type | resp_headers_Etag | resp_headers_Last-Modified | |
---|---|---|---|---|---|
0 | https://blog.adver.tools | 200 | initial/discovery | W/"66be08d3-12726" | Thu, 15 Aug 2024 13:55:31 GMT |
1 | https://blog.adver.tools/posts/logfile-analysi... | 200 | initial/discovery | W/"66be08c1-9f3f" | Thu, 15 Aug 2024 13:55:13 GMT |
2 | https://blog.adver.tools/posts/sitemap-request... | 200 | initial/discovery | W/"66be08a1-b243" | Thu, 15 Aug 2024 13:54:41 GMT |
3 | https://blog.adver.tools/posts/evergreen-crawl... | 200 | initial/discovery | W/"66be08a3-c222" | Thu, 15 Aug 2024 13:54:43 GMT |
4 | https://blog.adver.tools/posts/work-using-code... | 200 | initial/discovery | W/"66be08bd-9506" | Thu, 15 Aug 2024 13:55:09 GMT |
Since this server uses both, an Etag and a Last-Modified
tag, we can use either option for subsequent crawls. Let’s crawl it again with using the Etag.
custom_headers = {}
for url, etag in crawldf[['url', 'resp_headers_Etag']].values:
custom_headers[url] = {'If-None-Match': etag}
custom_headers
{'https://blog.adver.tools': {'If-None-Match': 'W/"66be08d3-12726"'},
'https://blog.adver.tools/posts/logfile-analysis-cli/index.html': {'If-None-Match': 'W/"66be08c1-9f3f"'},
'https://blog.adver.tools/posts/sitemap-request-headers/index.html': {'If-None-Match': 'W/"66be08a1-b243"'},
'https://blog.adver.tools/posts/evergreen-crawling-xml-sitemaps/index.html': {'If-None-Match': 'W/"66be08a3-c222"'},
'https://blog.adver.tools/posts/work-using-code/index.html': {'If-None-Match': 'W/"66be08bd-9506"'},
'https://blog.adver.tools/about.html': {'If-None-Match': 'W/"66be08c6-58b8"'},
'https://blog.adver.tools/index.html': {'If-None-Match': 'W/"66be08d3-12726"'},
'https://blog.adver.tools/posts/prompt-engineering/index.html': {'If-None-Match': 'W/"66be08b0-12ae2"'},
'https://blog.adver.tools/posts/internal-link-analysis/index.html': {'If-None-Match': 'W/"66be08ba-521583"'},
'https://blog.adver.tools/posts/bulk-prompting/index.html': {'If-None-Match': 'W/"66be08bc-a12f"'},
'https://blog.adver.tools/posts/word-co-occurrence-matrix/index.html': {'If-None-Match': 'W/"66be08e3-388c3b"'},
'https://blog.adver.tools/posts/llm-content-evaluation/index.html': {'If-None-Match': 'W/"66be08bb-11899"'},
'https://blog.adver.tools/posts/auditing-vs-analysis/index.html': {'If-None-Match': 'W/"66be08b1-64e4"'},
'https://blog.adver.tools/posts/compare-crawls/index.html': {'If-None-Match': 'W/"66be08b7-38ad9f"'},
'https://blog.adver.tools/posts/gsc-audit-analysis/slides/index.html': {'If-None-Match': 'W/"66be08c3-3b7a93"'},
'https://blog.adver.tools/posts/programming-vs-software-dev/index.html': {'If-None-Match': 'W/"66be08b1-6cad"'},
'https://blog.adver.tools/posts/invert-mapping/index.html': {'If-None-Match': 'W/"66be08b7-b498"'},
'https://blog.adver.tools/posts/serp-analysis/index.html': {'If-None-Match': 'W/"66be08b5-399232"'},
'https://blog.adver.tools/posts/llm-app-guidelines/index.html': {'If-None-Match': 'W/"66be08c0-836d"'},
'https://blog.adver.tools/posts/automating-python-scripts/': {'If-None-Match': 'W/"66be08c2-af9c"'},
'https://blog.adver.tools/posts/chart-zoom-in/index.html': {'If-None-Match': 'W/"66be08a6-39330e"'},
'https://blog.adver.tools/posts/bodytext-xpath-selector/index.html': {'If-None-Match': 'W/"66be08a2-bef4"'},
'https://blog.adver.tools/posts/automating-python-scripts/index.html': {'If-None-Match': 'W/"66be08c2-af9c"'},
'https://blog.adver.tools/posts/sitemap-dashboard/index.html': {'If-None-Match': 'W/"66be08a8-49c989"'},
'https://blog.adver.tools/posts/gsc-audit-analysis/index.html': {'If-None-Match': 'W/"66be08c5-3f3288"'},
'https://blog.adver.tools/posts/risk-of-people-trusting-ai/index.html': {'If-None-Match': 'W/"66be08c5-676b"'},
'https://blog.adver.tools/posts/crawl-file-structure/index.html': {'If-None-Match': 'W/"66be08a4-aedf"'},
'https://blog.adver.tools/posts/bulk-prompting/': {'If-None-Match': 'W/"66be08bc-a12f"'},
'https://blog.adver.tools/posts/external-links/index.html': {'If-None-Match': 'W/"66be08af-3cb9f7"'},
'https://blog.adver.tools/_files/content_guidelines.csv': {'If-None-Match': '"66771410-15ba"'},
'https://blog.adver.tools/posts/gsc-audit-analysis/': {'If-None-Match': 'W/"66be08c5-3f3288"'}}
This time, we will use both list mode and spider mode.
If we only supply the home page and set follow_links=True
, and that page was not changed, then the crawler would not crawl that page, and crawling would stop. If some of the pages were modified or added, we wouldn’t know about them. This is why we will supply all available URLs, together with their respective Etags, and set follow_links=True
.
status
200 31
304 31
Name: count, dtype: int64
We ran a second crawl that took a negligible amount of time, didn’t consume resources, and we now have a better idea about which, and how many pages were updated. Since we ran two crawls within minutes, obviously no pages were changed, and that’s why we have the same number of 200 and 304 (Not Modified
) status codes.
Setting custom headers for a very long list of URLs
The dictionary we used was fairly small, and we encountered no issues. If we have several thousands of custom headers, we might get an error. The solution to this is very simple, instead of a dictionary, we supply the path of a Python script that contains the dictionary. The only condition is that the variable (dictionary) be named custom_headers
. This can also help we when we want the custom headers to be generated dynamically, as part of a larger workflow, and so on.
my_custom_headers.py
The code would have to be run as follows:
Basic JavaScript rendering using 3rd party plugins like playwright
There is limited support for rendering JS in some cases, and the setup can be done as follows.
Let’s first try normal crawling and try to extract the quote text from this page
No data was extracted, and the desired column was not created.
Trying again with scrapy-playwright
, but first we need to install it:
adv.crawl(
"https://quotes.toscrape.com/js/",
"quotes_js.jsonl",
meta={'playwright': True},
xpath_selectors={'quote_text': "//span[@class='text']/text()"},
custom_settings={
"DOWNLOAD_HANDLERS": {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
}
)
0 NaN
1 “The world as we have created it is a process ...
Name: quote_text, dtype: object
[nan,
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
'“It is our choices, Harry, that show what we truly are, far more than our abilities.”',
'“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”',
'“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”',
"“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”",
'“Try not to become a man of success. Rather become a man of value.”',
'“It is better to be hated for what you are than to be loved for what you are not.”',
"“I have not failed. I've just found 10,000 ways that won't work.”",
"“A woman is like a tea bag; you never know how strong it is until it's in hot water.”",
'“A day without sunshine is like, you know, night.”']
Using a proxy while crawling
You can also supply the URL of a proxy server to the meta
parameter, through the proxy
key, which is as simple as this:
adv.crawl(
"https://example.com",
"output_file.jsonl",
meta={"proxy": "https://username:password@someproxy.com:10000"})
crawldf = pd.read_json('example_proxy.jsonl', lines=True)
crawldf[['proxy', 'download_slot', '_auth_proxy', 'request_headers_Proxy-Authorization']]
proxy | download_slot | _auth_proxy | request_headers_Proxy-Authorization | |
---|---|---|---|---|
0 | https://someproxy.com:10000 | example.com | https://someproxy.com:10000 | Basic b4N3ZnJrxUI1Zjo3ADY3Z3qsb2k9cU5QTkE3Y4U= |
Using rotating proxies
The above solution uses a single proxy, and sometimes you might want to use a random proxy out of a list of proxies. You can check out the recipe for rotating proxies. This depends on the separate library scrapy-rotating-proxies