(Timestamp('2024-07-05 10:47:28'), Timestamp('2024-07-15 23:05:01'))
Crawling the website
We crawl the same website twice, and obtain the two output files.
import advertools as adv
import pandas as pd
adv.crawl(
url_list='https://supermetrics.com/',
output_file='supermetrics_crawl1.jl', # replace "1" with "2" in the second crawl
follow_links=True,
custom_settings={
'LOG_FILE': 'supermetrics_crawl1.log' # replace "1" with "2" in the second crawl
})
df1 = pd.read_json('supermetrics_crawl1.jl', lines=True)
df2 = pd.read_json('supermetrics_crawl2.jl', lines=True)
Check the time difference
Get the maximum value of crawl_time
in each file which shows the moment the crawl ended.
Optionally, we can subtract those values to get the time difference in days, hours, minutes, etc.
Compare the sizes of the resulting files (DataFrames)
Compare crawled URLs
Using Python sets, we can see URLs in crawl1 and not in crawl2, and vice versa. We can also get the common URLs that were crawled both times.
df1_not_df2 = set(df1['url']).difference(df2['url'])
df2_not_df1 = set(df2['url']).difference(df1['url'])
common_urls = set(df1['url']).intersection(df2['url'])
md(f'#### URLs in `crawl1` not in `crawl2`: {len(df1_not_df2):,}')
md(f'#### URLs in `crawl2` not in `crawl1`: {len(df2_not_df1):,}')
md(f'#### Common URLs: {len(common_urls):,}')
URLs in crawl1
not in crawl2
: 213
URLs in crawl2
not in crawl1
: 115
Common URLs: 2,409
Comparing text elements
The adv.crawlytics.compare
function takes two DataFrames and a column name to compare.
Selecting title, we get the URLs where the titles have changed for the same URL. There is typically no need to get the ones that haven’t changed, but you can change this behavior by setting keep_equal=True
.
url | title_x | title_y | |
---|---|---|---|
0 | https://supermetrics.com/blog/google-ads-metrics | The 5 Google Ads metrics you really should be tracking - Supermetrics | Google Ads metrics: The KPIs that matter most and how to improve them - Supermetrics |
1 | https://supermetrics.com/case-studies/supermetrics-planetart | Supermetrics & PlanetArt | Customer Case Study - Supermetrics | 404 - Page not found - Supermetrics |
2 | https://community.supermetrics.com/events/product-updates-july-2024-16 | Join the conversation | Supermetrics Community | Product Updates: July 2024, Tue, 2 Jul. 2024 at 16:00, Europe/Helsinki | Supermetrics Community |
Similiarity ratio
It’s not difficult to check for the elements that have changed. But a single comma or space difference would result in the elements being seen as “different”. One way to handle this is to measure how much of a difference there is.
Using Python’s difflib
, we can measure the similarity ratio between two strings. This ratio varies between zero (nothing in common) and one (the exact same). This way we know in which cases we have very big (or small) changes.
import difflib
def similarity_ratio(a, b):
ratios = []
for item_a, item_b in zip(a, b):
matcher = difflib.SequenceMatcher(a=item_a, b=item_b)
matcher.get_matching_blocks()
ratios.append(matcher.ratio())
return ratios
compare_title['similarity'] = similarity_ratio(compare_title['title_x'], compare_title['title_y'])
compare_title.sort_values('similarity')
url | title_x | title_y | similarity | |
---|---|---|---|---|
2 | https://community.supermetrics.com/events/product-updates-july-2024-16 | Join the conversation | Supermetrics Community | Product Updates: July 2024, Tue, 2 Jul. 2024 at 16:00, Europe/Helsinki | Supermetrics Community | 0.425532 |
1 | https://supermetrics.com/case-studies/supermetrics-planetart | Supermetrics & PlanetArt | Customer Case Study - Supermetrics | 404 - Page not found - Supermetrics | 0.500000 |
0 | https://supermetrics.com/blog/google-ads-metrics | The 5 Google Ads metrics you really should be tracking - Supermetrics | Google Ads metrics: The KPIs that matter most and how to improve them - Supermetrics | 0.575163 |
Sorting by similarity we can start to set our priorities and work accordingly.
Note that similarity is calculated by string matching, and makes no semantic checks.
To get even better inisghts while comparing, we can also get the matching blocks across the compared documents. The matching blocks are lists of substrings that exist in both strings and match one another. For example:
string_a = 'one two three four five'
string_b = 'one three four six two'
get_matching_blocks([string_a], [string_b])
[['one', ' three four ', 'i']]
As you can see, the matching blocks are “one”, ” three four “, and”i”. This helps show what hasn’t changed, and together with the similarity ratio can make things easier for us. It’s not perfect, it didn’t show “two” for example, even though it is a common sub-string. This is how the algorithm is designed, so keep this in mind.
Combining both similarity ratio and matching blocks:
compare_h2 = adv.crawlytics.compare(df1, df2, 'h2')
compare_h2['similarity'] = similarity_ratio(compare_h2['h2_x'], compare_h2['h2_y'])
h2_matching_blocks = get_matching_blocks(compare_h2['h2_x'], compare_h2['h2_y'])
compare_h2['matching_blocks'] = h2_matching_blocks
compare_h2.sort_values('similarity').head()
url | h2_x | h2_y | similarity | matching_blocks | |
---|---|---|---|---|---|
189 | https://supermetrics.com/solutions/b2b-saas | Supermetrics for B2B SaaS@@Spend time pulling insights, not data@@Spot and react to sudden changes@@Flexible solutio... | Fueling insights for 200K+ companies in 120 countries@@Why B2B SaaS companies choose Supermetrics@@Spend time pullin... | 0.230872 | [Fueling insights for 200K+ companies in 120 countries@@, B2B SaaS companies , ow, @@Create ] |
194 | https://supermetrics.com/solutions/smb | Supermetrics for SMB@@Get data without chaos@@Make real-time decisions@@Solution that grows with you@@Which destinat... | Fueling insights for 200K+ companies in 120 countries@@Supermetrics for SMB@@Get data without chaos@@Make real-time ... | 0.282989 | [Fueling insights for 200K+ companies in 120 countries@@, SMB, @@Create one source of truth@@] |
20 | https://supermetrics.com/blog/google-ads-metrics | 1. Impressions@@2. Clicks @@3. Cost@@4. Conversions@@5. Click-through rate (CTR)@@Conclusion@@About the author@@Turn... | 1. Impressions@@2. Clicks @@3. Cost@@Cost per click (CPC)@@4. Click-through rate (CTR)@@5. Conversions @@6. Search i... | 0.309735 | [1. Impressions@@2. Clicks @@3. Cost@@, 4, . Click-through rate (CTR)@@, A, @@T] |
190 | https://supermetrics.com/solutions/data-teams | Supermetrics for data teams@@No maintenance, no worries@@Go after the truth@@Secure data transfers@@Which destinatio... | Fueling insights for 200K+ companies in 120 countries@@Why data teams choose Supermetrics?@@No maintenance, no worri... | 0.314935 | [Fueling insights for 200K+ companies in 120 countries@@, How data teams use Supermetrics to grow@@C] |
186 | https://supermetrics.com/solutions/seo | Supermetrics for SEO@@Get a complete view on SEO performance@@No more manual data collection@@Identify organic growt... | Fueling insights for 200K+ companies in 120 countries@@Why SEO teams choose Supermetrics?@@Get a complete view on SE... | 0.347354 | [Fueling insights for 200K+ companies in 120 countries@@, Here’s how you can use Supermetrics for SEO@@, Create one ... |
Here is an example comparing two h2 tags that have changed. The first with a low similarity ratio, and the second shows a high one.
for url, h2_x, h2_y, similarity, blocks in compare_h2.iloc[189:190, :].values:
md('### URL:')
md(url)
md('### Similarity ratio:')
md(f'{similarity:.1%}')
md(f'### h2 (crawl 1):')
md(h2_x)
md(f'### h2 (crawl 2):')
md(h2_y)
md('### Matching blocks:')
md(blocks)
URL:
https://supermetrics.com/solutions/b2b-saas
Similarity ratio:
23.1%
h2 (crawl 1):
Supermetrics for B2B SaaS@@Spend time pulling insights, not data@@Spot and react to sudden changes@@Flexible solution for any data needs@@Which destination is right for you?@@Fueling insights for 200K+ companies in 120 countries@@Here’s how B2B SaaS companies use Supermetrics to grow@@Create one source of truth@@
h2 (crawl 2):
Fueling insights for 200K+ companies in 120 countries@@Why B2B SaaS companies choose Supermetrics@@Spend time pulling insights, not data@@Spot and react to sudden changes@@Flexible solution for any data needs@@Here’s how B2B SaaS companies use Supermetrics to grow@@Build a better marketing strategy@@Enable better testing experiences@@Create full-funnel reporting@@Which destination is right for you?@@Create one source of truth@@
Matching blocks:
Fueling insights for 200K+ companies in 120 countries@@ B2B SaaS companies ow@@Create
for url, h2_x, h2_y, similarity, blocks in compare_h2.iloc[177:178, :].values:
md('### URL:')
md(url)
md('### Similarity ratio:')
md(f'{similarity:.1%}')
md(f'### h2 (crawl 1):')
md(h2_x)
md(f'### h2 (crawl 2):')
md(h2_y)
md('### Matching blocks:')
md(blocks)
URL:
https://supermetrics.com/podcasts/the-power-of-competitive-intelligence-in-b2b-marketing-with-vollrath
Similarity ratio:
94.4%
h2 (crawl 1):
You’ll learn@@Subscribe to the Marketing Intelligence Show@@Transcript@@Turn your marketing data into opportunity
h2 (crawl 2):
You’ll learn@@Subscribe to the Marketing Intelligence Show@@Turn your marketing data into opportunity
Matching blocks:
You’ll learn@@Subscribe to the Marketing Intelligence Show@@Turn your marketing data into opportunity
Comparing numeric values
size | download_timeout | download_latency | depth | status | redirect_times | redirect_ttl | resp_headers_X-Ratelimit-Limit | resp_headers_X-Ratelimit-Reset | resp_headers_X-Ratelimit-Total | resp_headers_X-Ratelimit-Used-Currentrequest | resp_headers_X-Envoy-Upstream-Service-Time | resp_headers_X-Kong-Proxy-Latency | resp_headers_X-Kong-Upstream-Latency | resp_headers_X-Runtime | jsonld_interactionStatistic.userInteractionCount | jsonld_mainEntity.upvoteCount | jsonld_mainEntity.answerCount | jsonld_mainEntity.acceptedAnswer | resp_headers_X-Edge-Cache-Ttl | resp_headers_Content-Length | og:image:width | og:image:height | jsonld_mainEntity.acceptedAnswer.upvoteCount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 263394 | 180 | 0.747089 | 0 | 200 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 228214 | 180 | 0.541921 | 1 | 200 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 232773 | 180 | 0.911505 | 1 | 200 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
If the selected column is numeric like size for example, you have a diff
and a diff_perc
columns showing the absolute and relative (percentage) changes respectively.
Typically almost all page sizes will have changed, even if by a few bytes, so it is more informative to check for changes above a certain threshold.
Here we check for the pages where size changed by 20%.
compare_size = adv.crawlytics.compare(df1, df2, 'size')
compare_size[compare_size['diff_perc'].abs().gt(0.2)].sample(5)
url | size_x | size_y | diff | diff_perc | |
---|---|---|---|---|---|
1792 | https://supermetrics.com/blog?page=25 | 633430 | 972266 | 338836 | 0.534923 |
1907 | https://community.supermetrics.com/data-visualization-32/channel-mix-dashboard-template-92 | 148151 | 93603 | -54548 | -0.368192 |
1882 | https://community.supermetrics.com/tips-and-tricks-42/chatterblast-uncovers-the-shift-to-brand-awareness-building-cl... | 84616 | 157621 | 73005 | 0.862780 |
1900 | https://supermetrics.com/case-studies?category=259 | 3141775 | 3817884 | 676109 | 0.215200 |
2160 | https://community.supermetrics.com/ask-the-community-43/no-link-for-product-updates-july-2024-174 | 174938 | 122253 | -52685 | -0.301164 |
Check pages where download_latency
changed by at least 50%.
compare_latency = adv.crawlytics.compare(df1, df2, 'download_latency')
compare_latency[compare_latency['diff_perc'].abs().gt(0.5)].sample(5)
url | download_latency_x | download_latency_y | diff | diff_perc | |
---|---|---|---|---|---|
2311 | https://supermetrics.com/careers/join-supermetrics | 1.212018 | 1.820851 | 0.608833 | 0.502330 |
407 | https://support.supermetrics.com/support/solutions/articles/19000098352-how-to-enroll-a-data-source-in-bigquery | 3.787312 | 0.291624 | -3.495688 | -0.923000 |
1019 | https://support.supermetrics.com/support/solutions/articles/19000111395-optimizely-authentication-and-reauthenticati... | 4.027495 | 0.951625 | -3.075870 | -0.763718 |
1907 | https://supermetrics.com/blog/category/marketing-analytics?page=5 | 1.137533 | 2.014322 | 0.876789 | 0.770781 |
1464 | https://support.supermetrics.com/support/solutions/articles/19000159890-google-analytics-universal-analytics-sunset-... | 0.522869 | 1.654083 | 1.131214 | 2.163476 |
Locating pages with big changes
Count the top /dir_1/
values to see if the pages that changed are predominantly in a certain section of the website, or evenly spread apart.
target_difference = 0.5
compare_latency_urldf = adv.url_to_df(
compare_latency[compare_latency['diff_perc'].abs().gt(target_difference)]['url'])
fig = adviz.value_counts(
compare_latency_urldf['dir_1'],
title='Top /dir_1/ where download latency changed by more than 50%')
for data in fig.data:
data.hoverinfo="skip"
fig