How to map internal links of a website, count in/out/all/links, calculate Pagerank, cluster and visualize by PR.
links
analytics
python
advertools
adviz
crawling
Author
Elias Dabbas
Published
July 6, 2024
Internal links on a website form the skeleton that holds the pages together. It is through links that we navigate and discover content. The number and quality of links is an important indicator of the quarlity of those pages and links.
This is a guide that can be used as a general starting point in auditing, understanding, and analyzing how a website’s pages are interlinked. We will crawl a website, map pages and links, utilize graph theory to convert this into a graph data structure. We will then get the important graph metrics for each page, and finally, we will cluster and visualize the links based on the measurements.
We start by crawling a website.
Crawl a website
import advertools as advimport advizfrom dash_bootstrap_templates import load_figure_templateload_figure_template('flatly')import pandas as pdimport networkx as nximport plotly.express as pxfrom sklearn.cluster import KMeanadv.crawl( url_list='https://supermetrics.com/', output_file='supermetrics_crawl.jl', follow_links=True, custom_settings={'LOG_FILE': 'supermetrics_crawl.log' })
Since we don’t need the whole website’s data, and just interested in the links, we will only load the columns of interest.
We can now use the advertools.crawlytics.links function to create the mapping of links showing each URL and which other URLs it links to, “from” and “to” respectively.
You already kow that example.com is not the same as example.com/ and so we need to normalize the data. Keep in mind that this might not be what search engines do, but here since we are interested in understanding and quantifying the relationships, we will remove the trailing slashes, and keep the actual normalization on the website as a separate task.
For this case, we will create a directed graph, because you can link from page A to page B without doing the opposite. The edges will not have weights because these are simple links (not distances between cities, or prices of going from a point to another).
Now we want to have a table that lists all URLs in our website (network), and for each URL we want to have a set of metrics that we calculate based on its position in the website.
We now want to calculate the following metrics for each URL
degree: The total number of links a page has (incoming + outgoing), or in_degree + out_degree.
in_degree: The number of other pages pointing to this URL.
out_degree: The number of pages that this page links to.
deg_centrality: The proportion of total nodes (URLs) that this URL is connected to. It is a simple calculation: degree ÷ number of nodes in the network.
pagerank: The one created by Larry Page. This is to give us numeric value that is more accurate than the simple link count.
network_df['degree'] = [links_graph.degree[node] for node in network_df['nodes']]network_df['in_degree'] = [links_graph.in_degree[node] for node in network_df['nodes']]network_df['out_degree'] = [links_graph.out_degree[node] for node in network_df['nodes']]deg_centrality = nx.degree_centrality(links_graph)network_df['deg_centrality'] = [deg_centrality[node] for node in network_df['nodes']]pr = nx.pagerank(links_graph)network_df['page_rank'] = [pr[node] for node in network_df['nodes']]number_of_nodes = links_graph.number_of_nodes()number_of_edges = links_graph.number_of_edges()network_df = network_df.rename(columns={'nodes': 'url','degree': 'links','in_degree': 'inlinks','out_degree': 'outlinks','deg_centrality': 'deg_centrality','page_rank': 'pagerank',})network_df
Now we have computed a numeric value for the various metrics and assigned it to each of the URLs. We also have it in a DataFrame, so visualization and analysis become straightforward.
Link distribution
How are my links distributed? How many pages do I have that have X links? What is the shape of the dstribution of this metric?
px.histogram( network_df, x='links', title='Distribution of number of (in + out) links per page', labels={'links': 'links per page'}, color_discrete_sequence=[primary for i inrange(len(network_df))])
We can see three clusters of pages. The small ones (less than 200 per page), the medium (~660 - 1,000), and the large (1600-1650).
We can think also of the extremely large ones, with more than two thousand links each. These are clearly the most important URLs. We can now check if it is right to give them that much weight, and if we should give more importance to other pages as well.
Pagerank distribution
Interpreting this metric is difficult because the sum of all pagerank values should sum to one. So, depending on the size of the network, the top values might assume numbers that are difficult to interpret. The top value in this case is 0.003864 for example. Is that a good figure?
Let’s first visualize the distrbution:
px.histogram( network_df, x='pagerank', title='Distribution of pagerank values', color_discrete_sequence=[primary for i inrange(len(network_df))])
Clustering and visualizing pagerank
To further categorize our URLs based on their pagerank values, we can cluster them into groups. Where would you draw the line if you were asked to split the pagerank values into “small” and “large”? Note that this is not the same as the top ten or the top five percent. We want a clustering of small vs. large, regardless how many URLs end up in each cluster.
What if we want three clusters (small, medium, large)? What if we wanted four?
We will be using the KMeans clustering algorithm from scikit-learn to perform this. We can simply set the n_clusters parameter to get the clustering we want.
cdf = [x /len(network_df) for x inrange(1, len(network_df)+1)]fig = adviz.ecdf( network_df, x='pagerank', hover_name='url', color='pagerank_cluster', template='flatly', height=600, title='Internal link Pagerank clusters<br><b>SupterMetrics.com</b>',)for annotation in fig.layout.annotations: annotation.text =''for data in fig.data: data.name =''cumsum =0for data in fig.data:if data.type=='histogram': data.opacity =1continue data.y = cdf[cumsum:cumsum+len(data.y)] cumsum +=len(data.y) data.hovertemplate = data.hovertemplate.replace('percent: %{y:.1f}', 'percent: %{y:.1%}')fig.layout.legend.title ='Pagerank<br>cluster'fig
We now have all the URLs in our network plotted on the chart and colored by the cluster to which they belong. You can hover over any of the circles to get the relevant details. Had we chosen five clusters, we would have had five colors.
You can see the circles clustering around certain centers, with a few circles slightly further from their cluster’s center. Feel free to play around with that parameter and see what works for you.