Audit and Analyze Internal Links

How to map internal links of a website, count in/out/all/links, calculate Pagerank, cluster and visualize by PR.
links
analytics
python
advertools
adviz
crawling
Author

Elias Dabbas

Published

July 6, 2024

Internal links on a website form the skeleton that holds the pages together. It is through links that we navigate and discover content. The number and quality of links is an important indicator of the quarlity of those pages and links.

This is a guide that can be used as a general starting point in auditing, understanding, and analyzing how a website’s pages are interlinked. We will crawl a website, map pages and links, utilize graph theory to convert this into a graph data structure. We will then get the important graph metrics for each page, and finally, we will cluster and visualize the links based on the measurements.

We start by crawling a website.

Crawl a website

import advertools as adv
import adviz
from dash_bootstrap_templates import load_figure_template
load_figure_template('flatly')
import pandas as pd
import networkx as nx
import plotly.express as px
from sklearn.cluster import KMean

adv.crawl(
    url_list='https://supermetrics.com/',
    output_file='supermetrics_crawl.jl',
    follow_links=True,
    custom_settings={
        'LOG_FILE': 'supermetrics_crawl.log'
    })

Since we don’t need the whole website’s data, and just interested in the links, we will only load the columns of interest.

links_url = adv.crawlytics.jl_subset(
    'supermetrics_crawl.jl',
    columns=['url', 'links_url', 'links_nofollow'])

We can now use the advertools.crawlytics.links function to create the mapping of links showing each URL and which other URLs it links to, “from” and “to” respectively.

Data cleaning: URL normalization

You already kow that example.com is not the same as example.com/ and so we need to normalize the data. Keep in mind that this might not be what search engines do, but here since we are interested in understanding and quantifying the relationships, we will remove the trailing slashes, and keep the actual normalization on the website as a separate task.

links_df_norm = (link_df
1                 [link_df['internal']]
2                 [['url', 'link']]
                 .dropna()
3                 .apply(lambda col: col.str.rstrip('/')))
1
Filter for internal links
2
Take those two columns
3
Apply a function that strips the trailing slashes from the end of URLs

Creating a graph with networkx

For this case, we will create a directed graph, because you can link from page A to page B without doing the opposite. The edges will not have weights because these are simple links (not distances between cities, or prices of going from a point to another).

links_graph = nx.DiGraph()
links_graph.add_edges_from(links_df_norm.values)

Creating a DataFrame from the graph

Now we want to have a table that lists all URLs in our website (network), and for each URL we want to have a set of metrics that we calculate based on its position in the website.

network_df = pd.DataFrame({'nodes': links_graph.nodes})
network_df.head()
nodes
0 https://supermetrics.com
1 https://supermetrics.com/marketing-intelligence-cloud
2 https://supermetrics.com/marketing-intelligence-cloud/connect
3 https://supermetrics.com/marketing-intelligence-cloud/transform
4 https://supermetrics.com/marketing-intelligence-cloud/store

Graph metrics

We now want to calculate the following metrics for each URL

  • degree: The total number of links a page has (incoming + outgoing), or in_degree + out_degree.
  • in_degree: The number of other pages pointing to this URL.
  • out_degree: The number of pages that this page links to.
  • deg_centrality: The proportion of total nodes (URLs) that this URL is connected to. It is a simple calculation: degree ÷ number of nodes in the network.
  • pagerank: The one created by Larry Page. This is to give us numeric value that is more accurate than the simple link count.
network_df['degree'] = [links_graph.degree[node] for node in network_df['nodes']]
network_df['in_degree'] = [links_graph.in_degree[node] for node in network_df['nodes']]
network_df['out_degree'] = [links_graph.out_degree[node] for node in network_df['nodes']]
deg_centrality = nx.degree_centrality(links_graph)
network_df['deg_centrality'] = [deg_centrality[node] for node in network_df['nodes']]
pr = nx.pagerank(links_graph)
network_df['page_rank'] = [pr[node] for node in network_df['nodes']]
number_of_nodes = links_graph.number_of_nodes()
number_of_edges = links_graph.number_of_edges()
network_df = network_df.rename(columns={
    'nodes': 'url',
    'degree': 'links',
    'in_degree': 'inlinks',
    'out_degree': 'outlinks',
    'deg_centrality': 'deg_centrality',
    'page_rank': 'pagerank',
})
network_df
url links inlinks outlinks deg_centrality pagerank
0 https://supermetrics.com 1338 1264 74 0.277191 0.003864
1 https://supermetrics.com/marketing-intelligence-cloud 944 868 76 0.195567 0.003136
2 https://supermetrics.com/marketing-intelligence-cloud/connect 940 865 75 0.194738 0.003131
3 https://supermetrics.com/marketing-intelligence-cloud/transform 941 865 76 0.194945 0.003131
4 https://supermetrics.com/marketing-intelligence-cloud/store 941 865 76 0.194945 0.003131
... ... ... ... ... ... ...
4823 https://supermetrics.com/connectors/iqm 95 12 83 0.019681 0.000081
4824 https://supermetrics.com/connectors/line-ads 88 12 76 0.018231 0.000081
4825 https://supermetrics.com/connectors/omnisend 95 12 83 0.019681 0.000081
4826 https://supermetrics.com/connectors/teads 95 12 83 0.019681 0.000081
4827 https://supermetrics.com/connectors/youtube-public-data 95 12 83 0.019681 0.000081

4828 rows × 6 columns

Analyzing the network

Now we have computed a numeric value for the various metrics and assigned it to each of the URLs. We also have it in a DataFrame, so visualization and analysis become straightforward.

Pagerank distribution

Interpreting this metric is difficult because the sum of all pagerank values should sum to one. So, depending on the size of the network, the top values might assume numbers that are difficult to interpret. The top value in this case is 0.003864 for example. Is that a good figure?

Let’s first visualize the distrbution:

px.histogram(
    network_df,
    x='pagerank',
    title='Distribution of pagerank values',
    color_discrete_sequence=[primary for i in range(len(network_df))])

Clustering and visualizing pagerank

To further categorize our URLs based on their pagerank values, we can cluster them into groups. Where would you draw the line if you were asked to split the pagerank values into “small” and “large”? Note that this is not the same as the top ten or the top five percent. We want a clustering of small vs. large, regardless how many URLs end up in each cluster.

What if we want three clusters (small, medium, large)? What if we wanted four?

We will be using the KMeans clustering algorithm from scikit-learn to perform this. We can simply set the n_clusters parameter to get the clustering we want.

kmeans = KMeans(n_clusters=4)
kmeans.fit_transform(network_df[['pagerank']])
network_df['pagerank_cluster'] = kmeans.labels_
Code
cdf = [x / len(network_df) for x in range(1, len(network_df)+1)]
fig = adviz.ecdf(
    network_df,
    x='pagerank',
    hover_name='url',
    color='pagerank_cluster',
    template='flatly',
    height=600,
    title='Internal link Pagerank clusters<br><b>SupterMetrics.com</b>',
)
for annotation in fig.layout.annotations:
    annotation.text = ''
for data in fig.data:
    data.name = ''
cumsum = 0

for data in fig.data:
    if data.type == 'histogram':
        data.opacity = 1
        continue
    data.y = cdf[cumsum:cumsum+len(data.y)]
    cumsum += len(data.y)
    data.hovertemplate = data.hovertemplate.replace('percent: %{y:.1f}', 'percent: %{y:.1%}')
fig.layout.legend.title = 'Pagerank<br>cluster'
fig

We now have all the URLs in our network plotted on the chart and colored by the cluster to which they belong. You can hover over any of the circles to get the relevant details. Had we chosen five clusters, we would have had five colors.

You can see the circles clustering around certain centers, with a few circles slightly further from their cluster’s center. Feel free to play around with that parameter and see what works for you.

Pagerank cluster summary statistics table

Code
(network_df
 .groupby('pagerank_cluster')
 [['links', 'pagerank']]
 .describe()
 .sort_values(('pagerank', 'mean'))
 .style
 .background_gradient(cmap='cividis')
 .format('{:.3f}')
 .format('{:,.0f}', subset=[('links', 'count'), ('pagerank', 'count')]))
  links pagerank
  count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max
pagerank_cluster                                
0 3,978 100.183 187.771 1.000 1.000 2.000 93.000 827.000 3,978 0.000 0.000 0.000 0.000 0.000 0.000 0.000
2 602 1550.703 283.666 23.000 1602.000 1604.000 1613.000 1626.000 602 0.001 0.000 0.000 0.001 0.001 0.001 0.001
3 174 597.598 194.964 41.000 566.000 575.000 575.000 1610.000 174 0.001 0.000 0.001 0.001 0.001 0.001 0.002
1 74 1223.311 499.324 843.000 939.000 945.500 1335.250 2568.000 74 0.003 0.000 0.003 0.003 0.003 0.004 0.005