Comparing Crawls – advertools Blog

Crawling the website

We crawl the same website twice, and obtain the two output files.

import advertools as adv
import pandas as pd

adv.crawl(
    url_list='https://supermetrics.com/',
    output_file='supermetrics_crawl1.jl', # replace "1" with "2" in the second crawl
    follow_links=True,
    custom_settings={
        'LOG_FILE': 'supermetrics_crawl1.log' # replace "1" with "2" in the second crawl
    })

df1 = pd.read_json('supermetrics_crawl1.jl', lines=True)
df2 = pd.read_json('supermetrics_crawl2.jl', lines=True)

Check the time difference

Get the maximum value of crawl_time in each file which shows the moment the crawl ended.

df1['crawl_time'].max(), df2['crawl_time'].max()

(Timestamp('2024-07-05 10:47:28'), Timestamp('2024-07-15 23:05:01'))

Optionally, we can subtract those values to get the time difference in days, hours, minutes, etc.

df2['crawl_time'].max() - df1['crawl_time'].max()

Timedelta('10 days 12:17:33')

Compare the sizes of the resulting files (DataFrames)

df1.shape, df2.shape

((2622, 172), (2524, 165))

Compare crawled URLs

Using Python sets, we can see URLs in crawl1 and not in crawl2, and vice versa. We can also get the common URLs that were crawled both times.

df1_not_df2 = set(df1['url']).difference(df2['url'])
df2_not_df1 = set(df2['url']).difference(df1['url'])
common_urls = set(df1['url']).intersection(df2['url'])
md(f'#### URLs in `crawl1` not in `crawl2`: {len(df1_not_df2):,}')
md(f'#### URLs in `crawl2` not in `crawl1`: {len(df2_not_df1):,}')
md(f'#### Common URLs: {len(common_urls):,}')

URLs in `crawl1` not in `crawl2`: 213

URLs in `crawl2` not in `crawl1`: 115

Common URLs: 2,409

Comparing text elements

The adv.crawlytics.compare function takes two DataFrames and a column name to compare.

Selecting title, we get the URLs where the titles have changed for the same URL. There is typically no need to get the ones that haven’t changed, but you can change this behavior by setting keep_equal=True.

compare_title = adv.crawlytics.compare(df1, df2, 'title')
compare_title

	url	title_x	title_y
0	https://supermetrics.com/blog/google-ads-metrics	The 5 Google Ads metrics you really should be tracking - Supermetrics	Google Ads metrics: The KPIs that matter most and how to improve them - Supermetrics
1	https://supermetrics.com/case-studies/supermetrics-planetart	Supermetrics & PlanetArt \| Customer Case Study - Supermetrics	404 - Page not found - Supermetrics
2	https://community.supermetrics.com/events/product-updates-july-2024-16	Join the conversation \| Supermetrics Community	Product Updates: July 2024, Tue, 2 Jul. 2024 at 16:00, Europe/Helsinki \| Supermetrics Community

Similiarity ratio

It’s not difficult to check for the elements that have changed. But a single comma or space difference would result in the elements being seen as “different”. One way to handle this is to measure how much of a difference there is.

Using Python’s difflib, we can measure the similarity ratio between two strings. This ratio varies between zero (nothing in common) and one (the exact same). This way we know in which cases we have very big (or small) changes.

import difflib
def similarity_ratio(a, b):
    ratios = []
    for item_a, item_b in zip(a, b):
        matcher = difflib.SequenceMatcher(a=item_a, b=item_b)
        matcher.get_matching_blocks()
        ratios.append(matcher.ratio())
    return ratios

compare_title['similarity'] = similarity_ratio(compare_title['title_x'], compare_title['title_y'])
compare_title.sort_values('similarity')

	url	title_x	title_y	similarity
2	https://community.supermetrics.com/events/product-updates-july-2024-16	Join the conversation \| Supermetrics Community	Product Updates: July 2024, Tue, 2 Jul. 2024 at 16:00, Europe/Helsinki \| Supermetrics Community	0.425532
1	https://supermetrics.com/case-studies/supermetrics-planetart	Supermetrics & PlanetArt \| Customer Case Study - Supermetrics	404 - Page not found - Supermetrics	0.500000
0	https://supermetrics.com/blog/google-ads-metrics	The 5 Google Ads metrics you really should be tracking - Supermetrics	Google Ads metrics: The KPIs that matter most and how to improve them - Supermetrics	0.575163

Sorting by similarity we can start to set our priorities and work accordingly.

Note

Note that similarity is calculated by string matching, and makes no semantic checks.

To get even better inisghts while comparing, we can also get the matching blocks across the compared documents. The matching blocks are lists of substrings that exist in both strings and match one another. For example:

string_a = 'one two three four five'
string_b = 'one three four six two'
get_matching_blocks([string_a], [string_b])

[['one', ' three four ', 'i']]

As you can see, the matching blocks are “one”, ” three four “, and”i”. This helps show what hasn’t changed, and together with the similarity ratio can make things easier for us. It’s not perfect, it didn’t show “two” for example, even though it is a common sub-string. This is how the algorithm is designed, so keep this in mind.

Combining both similarity ratio and matching blocks:

compare_h2 = adv.crawlytics.compare(df1, df2, 'h2')
compare_h2['similarity'] = similarity_ratio(compare_h2['h2_x'], compare_h2['h2_y'])
h2_matching_blocks = get_matching_blocks(compare_h2['h2_x'], compare_h2['h2_y'])
compare_h2['matching_blocks'] = h2_matching_blocks
compare_h2.sort_values('similarity').head()

	url	h2_x	h2_y	similarity	matching_blocks
189	https://supermetrics.com/solutions/b2b-saas	Supermetrics for B2B SaaS@@Spend time pulling insights, not data@@Spot and react to sudden changes@@Flexible solutio...	Fueling insights for 200K+ companies in 120 countries@@Why B2B SaaS companies choose Supermetrics@@Spend time pullin...	0.230872	[Fueling insights for 200K+ companies in 120 countries@@, B2B SaaS companies , ow, @@Create ]
194	https://supermetrics.com/solutions/smb	Supermetrics for SMB@@Get data without chaos@@Make real-time decisions@@Solution that grows with you@@Which destinat...	Fueling insights for 200K+ companies in 120 countries@@Supermetrics for SMB@@Get data without chaos@@Make real-time ...	0.282989	[Fueling insights for 200K+ companies in 120 countries@@, SMB, @@Create one source of truth@@]
20	https://supermetrics.com/blog/google-ads-metrics	1. Impressions@@2. Clicks @@3. Cost@@4. Conversions@@5. Click-through rate (CTR)@@Conclusion@@About the author@@Turn...	1. Impressions@@2. Clicks @@3. Cost@@Cost per click (CPC)@@4. Click-through rate (CTR)@@5. Conversions @@6. Search i...	0.309735	[1. Impressions@@2. Clicks @@3. Cost@@, 4, . Click-through rate (CTR)@@, A, @@T]
190	https://supermetrics.com/solutions/data-teams	Supermetrics for data teams@@No maintenance, no worries@@Go after the truth@@Secure data transfers@@Which destinatio...	Fueling insights for 200K+ companies in 120 countries@@Why data teams choose Supermetrics?@@No maintenance, no worri...	0.314935	[Fueling insights for 200K+ companies in 120 countries@@, How data teams use Supermetrics to grow@@C]
186	https://supermetrics.com/solutions/seo	Supermetrics for SEO@@Get a complete view on SEO performance@@No more manual data collection@@Identify organic growt...	Fueling insights for 200K+ companies in 120 countries@@Why SEO teams choose Supermetrics?@@Get a complete view on SE...	0.347354	[Fueling insights for 200K+ companies in 120 countries@@, Here’s how you can use Supermetrics for SEO@@, Create one ...

Here is an example comparing two h2 tags that have changed. The first with a low similarity ratio, and the second shows a high one.

for url, h2_x, h2_y, similarity, blocks in compare_h2.iloc[189:190, :].values:
    md('### URL:')
    md(url)
    md('### Similarity ratio:')
    md(f'{similarity:.1%}')
    md(f'### h2 (crawl 1):')
    md(h2_x)
    md(f'### h2 (crawl 2):')
    md(h2_y)
    md('### Matching blocks:')
    md(blocks)

URL:

https://supermetrics.com/solutions/b2b-saas

Similarity ratio:

23.1%

h2 (crawl 1):

Supermetrics for B2B SaaS@@Spend time pulling insights, not data@@Spot and react to sudden changes@@Flexible solution for any data needs@@Which destination is right for you?@@Fueling insights for 200K+ companies in 120 countries@@Here’s how B2B SaaS companies use Supermetrics to grow@@Create one source of truth@@

h2 (crawl 2):

Fueling insights for 200K+ companies in 120 countries@@Why B2B SaaS companies choose Supermetrics@@Spend time pulling insights, not data@@Spot and react to sudden changes@@Flexible solution for any data needs@@Here’s how B2B SaaS companies use Supermetrics to grow@@Build a better marketing strategy@@Enable better testing experiences@@Create full-funnel reporting@@Which destination is right for you?@@Create one source of truth@@

Matching blocks:

Fueling insights for 200K+ companies in 120 countries@@ B2B SaaS companies ow@@Create

for url, h2_x, h2_y, similarity, blocks in compare_h2.iloc[177:178, :].values:
    md('### URL:')
    md(url)
    md('### Similarity ratio:')
    md(f'{similarity:.1%}')
    md(f'### h2 (crawl 1):')
    md(h2_x)
    md(f'### h2 (crawl 2):')
    md(h2_y)
    md('### Matching blocks:')
    md(blocks)

URL:

https://supermetrics.com/podcasts/the-power-of-competitive-intelligence-in-b2b-marketing-with-vollrath

Similarity ratio:

94.4%

h2 (crawl 1):

You’ll learn@@Subscribe to the Marketing Intelligence Show@@Transcript@@Turn your marketing data into opportunity

h2 (crawl 2):

You’ll learn@@Subscribe to the Marketing Intelligence Show@@Turn your marketing data into opportunity

Matching blocks:

You’ll learn@@Subscribe to the Marketing Intelligence Show@@Turn your marketing data into opportunity

Comparing numeric values

df1.select_dtypes('number').head(3)

	size	download_timeout	download_latency	depth	status	redirect_times	redirect_ttl	resp_headers_X-Ratelimit-Limit	resp_headers_X-Ratelimit-Reset	resp_headers_X-Ratelimit-Total	resp_headers_X-Ratelimit-Used-Currentrequest	resp_headers_X-Envoy-Upstream-Service-Time	resp_headers_X-Kong-Proxy-Latency	resp_headers_X-Kong-Upstream-Latency	resp_headers_X-Runtime	jsonld_interactionStatistic.userInteractionCount	jsonld_mainEntity.upvoteCount	jsonld_mainEntity.answerCount	jsonld_mainEntity.acceptedAnswer	resp_headers_X-Edge-Cache-Ttl	resp_headers_Content-Length	og:image:width	og:image:height	jsonld_mainEntity.acceptedAnswer.upvoteCount
0	263394	180	0.747089	0	200	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	228214	180	0.541921	1	200	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	232773	180	0.911505	1	200	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

If the selected column is numeric like size for example, you have a diff and a diff_perc columns showing the absolute and relative (percentage) changes respectively.

Typically almost all page sizes will have changed, even if by a few bytes, so it is more informative to check for changes above a certain threshold.

Here we check for the pages where size changed by 20%.

compare_size = adv.crawlytics.compare(df1, df2, 'size')
compare_size[compare_size['diff_perc'].abs().gt(0.2)].sample(5)

	url	size_x	size_y	diff	diff_perc
1792	https://supermetrics.com/blog?page=25	633430	972266	338836	0.534923
1907	https://community.supermetrics.com/data-visualization-32/channel-mix-dashboard-template-92	148151	93603	-54548	-0.368192
1882	https://community.supermetrics.com/tips-and-tricks-42/chatterblast-uncovers-the-shift-to-brand-awareness-building-cl...	84616	157621	73005	0.862780
1900	https://supermetrics.com/case-studies?category=259	3141775	3817884	676109	0.215200
2160	https://community.supermetrics.com/ask-the-community-43/no-link-for-product-updates-july-2024-174	174938	122253	-52685	-0.301164

Check pages where download_latency changed by at least 50%.

compare_latency = adv.crawlytics.compare(df1, df2, 'download_latency')
compare_latency[compare_latency['diff_perc'].abs().gt(0.5)].sample(5)

	url	download_latency_x	download_latency_y	diff	diff_perc
2311	https://supermetrics.com/careers/join-supermetrics	1.212018	1.820851	0.608833	0.502330
407	https://support.supermetrics.com/support/solutions/articles/19000098352-how-to-enroll-a-data-source-in-bigquery	3.787312	0.291624	-3.495688	-0.923000
1019	https://support.supermetrics.com/support/solutions/articles/19000111395-optimizely-authentication-and-reauthenticati...	4.027495	0.951625	-3.075870	-0.763718
1907	https://supermetrics.com/blog/category/marketing-analytics?page=5	1.137533	2.014322	0.876789	0.770781
1464	https://support.supermetrics.com/support/solutions/articles/19000159890-google-analytics-universal-analytics-sunset-...	0.522869	1.654083	1.131214	2.163476

Locating pages with big changes

Count the top /dir_1/ values to see if the pages that changed are predominantly in a certain section of the website, or evenly spread apart.

target_difference = 0.5
compare_latency_urldf = adv.url_to_df(
    compare_latency[compare_latency['diff_perc'].abs().gt(target_difference)]['url'])
fig = adviz.value_counts(
    compare_latency_urldf['dir_1'],
    title='Top /dir_1/ where download latency changed by more than 50%')
for data in fig.data:
    data.hoverinfo="skip"
fig

Crawling the website

Check the time difference

Compare the sizes of the resulting files (DataFrames)

Compare crawled URLs

URLs in crawl1 not in crawl2: 213

URLs in crawl2 not in crawl1: 115

Common URLs: 2,409

Comparing text elements

Similiarity ratio

URL:

Similarity ratio:

h2 (crawl 1):

h2 (crawl 2):

Matching blocks:

URL:

Similarity ratio:

h2 (crawl 1):

h2 (crawl 2):

Matching blocks:

Comparing numeric values

Locating pages with big changes

URLs in `crawl1` not in `crawl2`: 213

URLs in `crawl2` not in `crawl1`: 115