Comparing Crawls

Various ways to compare two crawls of the same website based on the common URLs and your choice of page element (title, h1, meta description, etc).
crawling
scraping
advertools
v0.15
Author

Elias Dabbas

Published

July 18, 2024

Crawling the website

We crawl the same website twice, and obtain the two output files.

import advertools as adv
import pandas as pd

adv.crawl(
    url_list='https://supermetrics.com/',
    output_file='supermetrics_crawl1.jl', # replace "1" with "2" in the second crawl
    follow_links=True,
    custom_settings={
        'LOG_FILE': 'supermetrics_crawl1.log' # replace "1" with "2" in the second crawl
    })

df1 = pd.read_json('supermetrics_crawl1.jl', lines=True)
df2 = pd.read_json('supermetrics_crawl2.jl', lines=True)

Check the time difference

Get the maximum value of crawl_time in each file which shows the moment the crawl ended.

df1['crawl_time'].max(), df2['crawl_time'].max()
(Timestamp('2024-07-05 10:47:28'), Timestamp('2024-07-15 23:05:01'))

Optionally, we can subtract those values to get the time difference in days, hours, minutes, etc.

df2['crawl_time'].max() - df1['crawl_time'].max()
Timedelta('10 days 12:17:33')

Compare the sizes of the resulting files (DataFrames)

df1.shape, df2.shape
((2622, 172), (2524, 165))

Compare crawled URLs

Using Python sets, we can see URLs in crawl1 and not in crawl2, and vice versa. We can also get the common URLs that were crawled both times.

df1_not_df2 = set(df1['url']).difference(df2['url'])
df2_not_df1 = set(df2['url']).difference(df1['url'])
common_urls = set(df1['url']).intersection(df2['url'])
md(f'#### URLs in `crawl1` not in `crawl2`: {len(df1_not_df2):,}')
md(f'#### URLs in `crawl2` not in `crawl1`: {len(df2_not_df1):,}')
md(f'#### Common URLs: {len(common_urls):,}')

URLs in crawl1 not in crawl2: 213

URLs in crawl2 not in crawl1: 115

Common URLs: 2,409

Comparing text elements

The adv.crawlytics.compare function takes two DataFrames and a column name to compare.

Selecting title, we get the URLs where the titles have changed for the same URL. There is typically no need to get the ones that haven’t changed, but you can change this behavior by setting keep_equal=True.

compare_title = adv.crawlytics.compare(df1, df2, 'title')
compare_title
url title_x title_y
0 https://supermetrics.com/blog/google-ads-metrics The 5 Google Ads metrics you really should be tracking - Supermetrics Google Ads metrics: The KPIs that matter most and how to improve them - Supermetrics
1 https://supermetrics.com/case-studies/supermetrics-planetart Supermetrics & PlanetArt | Customer Case Study - Supermetrics 404 - Page not found - Supermetrics
2 https://community.supermetrics.com/events/product-updates-july-2024-16 Join the conversation | Supermetrics Community Product Updates: July 2024, Tue, 2 Jul. 2024 at 16:00, Europe/Helsinki | Supermetrics Community

Similiarity ratio

It’s not difficult to check for the elements that have changed. But a single comma or space difference would result in the elements being seen as “different”. One way to handle this is to measure how much of a difference there is.

Using Python’s difflib, we can measure the similarity ratio between two strings. This ratio varies between zero (nothing in common) and one (the exact same). This way we know in which cases we have very big (or small) changes.

import difflib
def similarity_ratio(a, b):
    ratios = []
    for item_a, item_b in zip(a, b):
        matcher = difflib.SequenceMatcher(a=item_a, b=item_b)
        matcher.get_matching_blocks()
        ratios.append(matcher.ratio())
    return ratios
compare_title['similarity'] = similarity_ratio(compare_title['title_x'], compare_title['title_y'])
compare_title.sort_values('similarity')
url title_x title_y similarity
2 https://community.supermetrics.com/events/product-updates-july-2024-16 Join the conversation | Supermetrics Community Product Updates: July 2024, Tue, 2 Jul. 2024 at 16:00, Europe/Helsinki | Supermetrics Community 0.425532
1 https://supermetrics.com/case-studies/supermetrics-planetart Supermetrics & PlanetArt | Customer Case Study - Supermetrics 404 - Page not found - Supermetrics 0.500000
0 https://supermetrics.com/blog/google-ads-metrics The 5 Google Ads metrics you really should be tracking - Supermetrics Google Ads metrics: The KPIs that matter most and how to improve them - Supermetrics 0.575163

Sorting by similarity we can start to set our priorities and work accordingly.

Note

Note that similarity is calculated by string matching, and makes no semantic checks.

To get even better inisghts while comparing, we can also get the matching blocks across the compared documents. The matching blocks are lists of substrings that exist in both strings and match one another. For example:

string_a = 'one two three four five'
string_b = 'one three four six two'
get_matching_blocks([string_a], [string_b])
[['one', ' three four ', 'i']]

As you can see, the matching blocks are “one”, ” three four “, and”i”. This helps show what hasn’t changed, and together with the similarity ratio can make things easier for us. It’s not perfect, it didn’t show “two” for example, even though it is a common sub-string. This is how the algorithm is designed, so keep this in mind.

Combining both similarity ratio and matching blocks:

compare_h2 = adv.crawlytics.compare(df1, df2, 'h2')
compare_h2['similarity'] = similarity_ratio(compare_h2['h2_x'], compare_h2['h2_y'])
h2_matching_blocks = get_matching_blocks(compare_h2['h2_x'], compare_h2['h2_y'])
compare_h2['matching_blocks'] = h2_matching_blocks
compare_h2.sort_values('similarity').head()
url h2_x h2_y similarity matching_blocks
189 https://supermetrics.com/solutions/b2b-saas Supermetrics for B2B SaaS@@Spend time pulling insights, not data@@Spot and react to sudden changes@@Flexible solutio... Fueling insights for 200K+ companies in 120 countries@@Why B2B SaaS companies choose Supermetrics@@Spend time pullin... 0.230872 [Fueling insights for 200K+ companies in 120 countries@@, B2B SaaS companies , ow, @@Create ]
194 https://supermetrics.com/solutions/smb Supermetrics for SMB@@Get data without chaos@@Make real-time decisions@@Solution that grows with you@@Which destinat... Fueling insights for 200K+ companies in 120 countries@@Supermetrics for SMB@@Get data without chaos@@Make real-time ... 0.282989 [Fueling insights for 200K+ companies in 120 countries@@, SMB, @@Create one source of truth@@]
20 https://supermetrics.com/blog/google-ads-metrics 1. Impressions@@2. Clicks @@3. Cost@@4. Conversions@@5. Click-through rate (CTR)@@Conclusion@@About the author@@Turn... 1. Impressions@@2. Clicks @@3. Cost@@Cost per click (CPC)@@4. Click-through rate (CTR)@@5. Conversions @@6. Search i... 0.309735 [1. Impressions@@2. Clicks @@3. Cost@@, 4, . Click-through rate (CTR)@@, A, @@T]
190 https://supermetrics.com/solutions/data-teams Supermetrics for data teams@@No maintenance, no worries@@Go after the truth@@Secure data transfers@@Which destinatio... Fueling insights for 200K+ companies in 120 countries@@Why data teams choose Supermetrics?@@No maintenance, no worri... 0.314935 [Fueling insights for 200K+ companies in 120 countries@@, How data teams use Supermetrics to grow@@C]
186 https://supermetrics.com/solutions/seo Supermetrics for SEO@@Get a complete view on SEO performance@@No more manual data collection@@Identify organic growt... Fueling insights for 200K+ companies in 120 countries@@Why SEO teams choose Supermetrics?@@Get a complete view on SE... 0.347354 [Fueling insights for 200K+ companies in 120 countries@@, Here’s how you can use Supermetrics for SEO@@, Create one ...

Here is an example comparing two h2 tags that have changed. The first with a low similarity ratio, and the second shows a high one.

for url, h2_x, h2_y, similarity, blocks in compare_h2.iloc[189:190, :].values:
    md('### URL:')
    md(url)
    md('### Similarity ratio:')
    md(f'{similarity:.1%}')
    md(f'### h2 (crawl 1):')
    md(h2_x)
    md(f'### h2 (crawl 2):')
    md(h2_y)
    md('### Matching blocks:')
    md(blocks)

URL:

https://supermetrics.com/solutions/b2b-saas

Similarity ratio:

23.1%

h2 (crawl 1):

Supermetrics for B2B SaaS@@Spend time pulling insights, not data@@Spot and react to sudden changes@@Flexible solution for any data needs@@Which destination is right for you?@@Fueling insights for 200K+ companies in 120 countries@@Here’s how B2B SaaS companies use Supermetrics to grow@@Create one source of truth@@

h2 (crawl 2):

Fueling insights for 200K+ companies in 120 countries@@Why B2B SaaS companies choose Supermetrics@@Spend time pulling insights, not data@@Spot and react to sudden changes@@Flexible solution for any data needs@@Here’s how B2B SaaS companies use Supermetrics to grow@@Build a better marketing strategy@@Enable better testing experiences@@Create full-funnel reporting@@Which destination is right for you?@@Create one source of truth@@

Matching blocks:

Fueling insights for 200K+ companies in 120 countries@@ B2B SaaS companies ow@@Create

for url, h2_x, h2_y, similarity, blocks in compare_h2.iloc[177:178, :].values:
    md('### URL:')
    md(url)
    md('### Similarity ratio:')
    md(f'{similarity:.1%}')
    md(f'### h2 (crawl 1):')
    md(h2_x)
    md(f'### h2 (crawl 2):')
    md(h2_y)
    md('### Matching blocks:')
    md(blocks)

URL:

https://supermetrics.com/podcasts/the-power-of-competitive-intelligence-in-b2b-marketing-with-vollrath

Similarity ratio:

94.4%

h2 (crawl 1):

You’ll learn@@Subscribe to the Marketing Intelligence Show@@Transcript@@Turn your marketing data into opportunity

h2 (crawl 2):

You’ll learn@@Subscribe to the Marketing Intelligence Show@@Turn your marketing data into opportunity

Matching blocks:

You’ll learn@@Subscribe to the Marketing Intelligence Show@@Turn your marketing data into opportunity

Comparing numeric values

df1.select_dtypes('number').head(3)
size download_timeout download_latency depth status redirect_times redirect_ttl resp_headers_X-Ratelimit-Limit resp_headers_X-Ratelimit-Reset resp_headers_X-Ratelimit-Total resp_headers_X-Ratelimit-Used-Currentrequest resp_headers_X-Envoy-Upstream-Service-Time resp_headers_X-Kong-Proxy-Latency resp_headers_X-Kong-Upstream-Latency resp_headers_X-Runtime jsonld_interactionStatistic.userInteractionCount jsonld_mainEntity.upvoteCount jsonld_mainEntity.answerCount jsonld_mainEntity.acceptedAnswer resp_headers_X-Edge-Cache-Ttl resp_headers_Content-Length og:image:width og:image:height jsonld_mainEntity.acceptedAnswer.upvoteCount
0 263394 180 0.747089 0 200 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 228214 180 0.541921 1 200 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 232773 180 0.911505 1 200 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

If the selected column is numeric like size for example, you have a diff and a diff_perc columns showing the absolute and relative (percentage) changes respectively.

Typically almost all page sizes will have changed, even if by a few bytes, so it is more informative to check for changes above a certain threshold.

Here we check for the pages where size changed by 20%.

compare_size = adv.crawlytics.compare(df1, df2, 'size')
compare_size[compare_size['diff_perc'].abs().gt(0.2)].sample(5)
url size_x size_y diff diff_perc
1792 https://supermetrics.com/blog?page=25 633430 972266 338836 0.534923
1907 https://community.supermetrics.com/data-visualization-32/channel-mix-dashboard-template-92 148151 93603 -54548 -0.368192
1882 https://community.supermetrics.com/tips-and-tricks-42/chatterblast-uncovers-the-shift-to-brand-awareness-building-cl... 84616 157621 73005 0.862780
1900 https://supermetrics.com/case-studies?category=259 3141775 3817884 676109 0.215200
2160 https://community.supermetrics.com/ask-the-community-43/no-link-for-product-updates-july-2024-174 174938 122253 -52685 -0.301164

Check pages where download_latency changed by at least 50%.

compare_latency = adv.crawlytics.compare(df1, df2, 'download_latency')
compare_latency[compare_latency['diff_perc'].abs().gt(0.5)].sample(5)
url download_latency_x download_latency_y diff diff_perc
2311 https://supermetrics.com/careers/join-supermetrics 1.212018 1.820851 0.608833 0.502330
407 https://support.supermetrics.com/support/solutions/articles/19000098352-how-to-enroll-a-data-source-in-bigquery 3.787312 0.291624 -3.495688 -0.923000
1019 https://support.supermetrics.com/support/solutions/articles/19000111395-optimizely-authentication-and-reauthenticati... 4.027495 0.951625 -3.075870 -0.763718
1907 https://supermetrics.com/blog/category/marketing-analytics?page=5 1.137533 2.014322 0.876789 0.770781
1464 https://support.supermetrics.com/support/solutions/articles/19000159890-google-analytics-universal-analytics-sunset-... 0.522869 1.654083 1.131214 2.163476

Locating pages with big changes

Count the top /dir_1/ values to see if the pages that changed are predominantly in a certain section of the website, or evenly spread apart.

target_difference = 0.5
compare_latency_urldf = adv.url_to_df(
    compare_latency[compare_latency['diff_perc'].abs().gt(target_difference)]['url'])
fig = adviz.value_counts(
    compare_latency_urldf['dir_1'],
    title='Top /dir_1/ where download latency changed by more than 50%')
for data in fig.data:
    data.hoverinfo="skip"
fig