Audit and Analyze Internal Links

Internal links on a website form the skeleton that holds the pages together. It is through links that we navigate and discover content. The number and quality of links is an important indicator of the quarlity of those pages and links.

This is a guide that can be used as a general starting point in auditing, understanding, and analyzing how a website’s pages are interlinked. We will crawl a website, map pages and links, utilize graph theory to convert this into a graph data structure. We will then get the important graph metrics for each page, and finally, we will cluster and visualize the links based on the measurements.

We start by crawling a website.

Crawl a website

import advertools as adv
import adviz
from dash_bootstrap_templates import load_figure_template
load_figure_template('flatly')
import pandas as pd
import networkx as nx
import plotly.express as px
from sklearn.cluster import KMean

adv.crawl(
    url_list='https://supermetrics.com/',
    output_file='supermetrics_crawl.jl',
    follow_links=True,
    custom_settings={
        'LOG_FILE': 'supermetrics_crawl.log'
    })

Since we don’t need the whole website’s data, and just interested in the links, we will only load the columns of interest.

links_url = adv.crawlytics.jl_subset(
    'supermetrics_crawl.jl',
    columns=['url', 'links_url', 'links_nofollow'])

We can now use the advertools.crawlytics.links function to create the mapping of links showing each URL and which other URLs it links to, “from” and “to” respectively.

Extract and map URLs to links

link_df = adv.crawlytics.links(links_url, internal_url_regex='supermetrics.com')
link_df[['url', 'link']]

	url	link
0	https://supermetrics.com/	https://supermetrics.com/
0	https://supermetrics.com/	https://supermetrics.com/marketing-intelligence-cloud
0	https://supermetrics.com/	https://supermetrics.com/marketing-intelligence-cloud/connect
0	https://supermetrics.com/	https://supermetrics.com/marketing-intelligence-cloud/transform
0	https://supermetrics.com/	https://supermetrics.com/marketing-intelligence-cloud/store
...	...	...
2621	https://supermetrics.com/connectors/iqm	https://www.facebook.com/Supermetrics
2621	https://supermetrics.com/connectors/iqm	https://www.linkedin.com/company/supermetrics
2621	https://supermetrics.com/connectors/iqm	https://www.instagram.com/supermetrics
2621	https://supermetrics.com/connectors/iqm	https://www.youtube.com/@supermetrics
2621	https://supermetrics.com/connectors/iqm	https://twitter.com/Supermetrics

881478 rows × 2 columns

Data cleaning: URL normalization

You already kow that example.com is not the same as example.com/ and so we need to normalize the data. Keep in mind that this might not be what search engines do, but here since we are interested in understanding and quantifying the relationships, we will remove the trailing slashes, and keep the actual normalization on the website as a separate task.

links_df_norm = (link_df
1                 [link_df['internal']]
2                 [['url', 'link']]
                 .dropna()
3                 .apply(lambda col: col.str.rstrip('/')))

1: Filter for internal links
2: Take those two columns
3: Apply a function that strips the trailing slashes from the end of URLs

Creating a graph with `networkx`

For this case, we will create a directed graph, because you can link from page A to page B without doing the opposite. The edges will not have weights because these are simple links (not distances between cities, or prices of going from a point to another).

links_graph = nx.DiGraph()
links_graph.add_edges_from(links_df_norm.values)

Creating a DataFrame from the graph

Now we want to have a table that lists all URLs in our website (network), and for each URL we want to have a set of metrics that we calculate based on its position in the website.

network_df = pd.DataFrame({'nodes': links_graph.nodes})
network_df.head()

	nodes
0	https://supermetrics.com
1	https://supermetrics.com/marketing-intelligence-cloud
2	https://supermetrics.com/marketing-intelligence-cloud/connect
3	https://supermetrics.com/marketing-intelligence-cloud/transform
4	https://supermetrics.com/marketing-intelligence-cloud/store

Graph metrics

We now want to calculate the following metrics for each URL

degree: The total number of links a page has (incoming + outgoing), or in_degree + out_degree.
in_degree: The number of other pages pointing to this URL.
out_degree: The number of pages that this page links to.
deg_centrality: The proportion of total nodes (URLs) that this URL is connected to. It is a simple calculation: degree ÷ number of nodes in the network.
pagerank: The one created by Larry Page. This is to give us numeric value that is more accurate than the simple link count.

network_df['degree'] = [links_graph.degree[node] for node in network_df['nodes']]
network_df['in_degree'] = [links_graph.in_degree[node] for node in network_df['nodes']]
network_df['out_degree'] = [links_graph.out_degree[node] for node in network_df['nodes']]
deg_centrality = nx.degree_centrality(links_graph)
network_df['deg_centrality'] = [deg_centrality[node] for node in network_df['nodes']]
pr = nx.pagerank(links_graph)
network_df['page_rank'] = [pr[node] for node in network_df['nodes']]
number_of_nodes = links_graph.number_of_nodes()
number_of_edges = links_graph.number_of_edges()
network_df = network_df.rename(columns={
    'nodes': 'url',
    'degree': 'links',
    'in_degree': 'inlinks',
    'out_degree': 'outlinks',
    'deg_centrality': 'deg_centrality',
    'page_rank': 'pagerank',
})
network_df

	url	links	inlinks	outlinks	deg_centrality	pagerank
0	https://supermetrics.com	1338	1264	74	0.277191	0.003864
1	https://supermetrics.com/marketing-intelligence-cloud	944	868	76	0.195567	0.003136
2	https://supermetrics.com/marketing-intelligence-cloud/connect	940	865	75	0.194738	0.003131
3	https://supermetrics.com/marketing-intelligence-cloud/transform	941	865	76	0.194945	0.003131
4	https://supermetrics.com/marketing-intelligence-cloud/store	941	865	76	0.194945	0.003131
...	...	...	...	...	...	...
4823	https://supermetrics.com/connectors/iqm	95	12	83	0.019681	0.000081
4824	https://supermetrics.com/connectors/line-ads	88	12	76	0.018231	0.000081
4825	https://supermetrics.com/connectors/omnisend	95	12	83	0.019681	0.000081
4826	https://supermetrics.com/connectors/teads	95	12	83	0.019681	0.000081
4827	https://supermetrics.com/connectors/youtube-public-data	95	12	83	0.019681	0.000081

4828 rows × 6 columns

Analyzing the network

Now we have computed a numeric value for the various metrics and assigned it to each of the URLs. We also have it in a DataFrame, so visualization and analysis become straightforward.

Link distribution

How are my links distributed? How many pages do I have that have X links? What is the shape of the dstribution of this metric?

px.histogram(
    network_df,
    x='links',
    title='Distribution of number of (in + out) links per page',
    labels={'links': 'links per page'},
    color_discrete_sequence=[primary for i in range(len(network_df))])

We can see three clusters of pages. The small ones (less than 200 per page), the medium (~660 - 1,000), and the large (1600-1650).

We can think also of the extremely large ones, with more than two thousand links each. These are clearly the most important URLs. We can now check if it is right to give them that much weight, and if we should give more importance to other pages as well.

Pagerank distribution

Interpreting this metric is difficult because the sum of all pagerank values should sum to one. So, depending on the size of the network, the top values might assume numbers that are difficult to interpret. The top value in this case is 0.003864 for example. Is that a good figure?

Let’s first visualize the distrbution:

px.histogram(
    network_df,
    x='pagerank',
    title='Distribution of pagerank values',
    color_discrete_sequence=[primary for i in range(len(network_df))])

Clustering and visualizing pagerank

To further categorize our URLs based on their pagerank values, we can cluster them into groups. Where would you draw the line if you were asked to split the pagerank values into “small” and “large”? Note that this is not the same as the top ten or the top five percent. We want a clustering of small vs. large, regardless how many URLs end up in each cluster.

What if we want three clusters (small, medium, large)? What if we wanted four?

We will be using the KMeans clustering algorithm from scikit-learn to perform this. We can simply set the n_clusters parameter to get the clustering we want.

kmeans = KMeans(n_clusters=4)
kmeans.fit_transform(network_df[['pagerank']])
network_df['pagerank_cluster'] = kmeans.labels_

Code

cdf = [x / len(network_df) for x in range(1, len(network_df)+1)]
fig = adviz.ecdf(
    network_df,
    x='pagerank',
    hover_name='url',
    color='pagerank_cluster',
    template='flatly',
    height=600,
    title='Internal link Pagerank clusters<br><b>SupterMetrics.com</b>',
)
for annotation in fig.layout.annotations:
    annotation.text = ''
for data in fig.data:
    data.name = ''
cumsum = 0

for data in fig.data:
    if data.type == 'histogram':
        data.opacity = 1
        continue
    data.y = cdf[cumsum:cumsum+len(data.y)]
    cumsum += len(data.y)
    data.hovertemplate = data.hovertemplate.replace('percent: %{y:.1f}', 'percent: %{y:.1%}')
fig.layout.legend.title = 'Pagerank<br>cluster'
fig

We now have all the URLs in our network plotted on the chart and colored by the cluster to which they belong. You can hover over any of the circles to get the relevant details. Had we chosen five clusters, we would have had five colors.

You can see the circles clustering around certain centers, with a few circles slightly further from their cluster’s center. Feel free to play around with that parameter and see what works for you.

Pagerank cluster summary statistics table