Word Co-Occurrence Matrix

How to create a co-occurrence matrix of words/tokens in a list of documents (keywords, page titles, product descriptions, etc.)
data visualization
plotly
analytics
text analysis
keywords
Author

Elias Dabbas

Published

August 15, 2024

The plan:

  • Get a list of documents (crawl file title tags, h1 tags, keywords, product descriptions, etc.)
  • Find the top N most frequently used words/terms
  • Locate the documents where these terms are used
  • Compute the correlation of these terms
  • Visualize with a heatmap

Getting a list of documents

In this example I will use title tags from a crawl file. You can easily use any other text column and run the same code again.

import advertools as adv
import pandas as pd
import plotly.express as px
df = pd.read_json('../compare-crawls/supermetrics_crawl2.jl', lines=True)
df[['url', 'title']].head()
url title
0 https://supermetrics.com/ Supermetrics: Turn your marketing data into op...
1 https://hub.supermetrics.com Log In Or Create An Account - Supermetrics
2 https://supermetrics.com/book-demo Book a free demo of the Supermetrics marketing...
3 https://supermetrics.com/contact Contact Supermetrics - Supermetrics
4 https://supermetrics.com/start-trial Start Free Trial - Supermetrics

Finding the top N most frequent terms

We have the flexibility of getting words, bigrams, trigrams, or any pattern we want to use, as well as remove any stopwords that we don’t think are relevant.

In this case, the name of the website is removed, because it is present in every title.

N = 25
rm_words = [
    '-', '|', 'to', 'and', 'in', 'for', 'a', 'the', 'with', 'your', 'of',
    '&', 'supermetrics'
]
top_words = adv.word_frequency(
    df['title'].dropna(),
    rm_words=rm_words).head(N)
top_words.head(10)
word abs_freq
0 support 1031
1 data 664
2 how 549
3 documentation 392
4 guide 354
5 google 352
6 marketing 232
7 ads 207
8 connection 194
9 studio 183

Locating the documents where the top terms are used

This is done by creating new boolean columns that check whether or not the document (page title in this case) matches the term.

Keep in mind that there are regex pitfalls over here. The pattern “love” would match the word “glove” but that’s not what we want.

One way to handle this is to add the word boundary character \b before and after the words.

r'\b' + "marketing" + r'\b'
'\\bmarketing\\b'
for word in top_words['word'][:N]:
    df[f'term_{word}'] = df['title'].str.lower().str.contains(r'\b' + word + r'\b', regex=True)
df.filter(regex='^title$|term_').iloc[:5, :5]
title term_support term_data term_how term_documentation
0 Supermetrics: Turn your marketing data into op... False True False False
1 Log In Or Create An Account - Supermetrics False False False False
2 Book a free demo of the Supermetrics marketing... False True False False
3 Contact Supermetrics - Supermetrics False False False False
4 Start Free Trial - Supermetrics False False False False

We now remove the beginning string “term_” from column names.

term_df = df.filter(regex='term_')
term_df.columns = [col.replace('term_', '') for col in term_df.columns]

Compute the correlation between the terms

corr_matrix = term_df.corr()
corr_matrix.iloc[:5, :5]
support data how documentation guide
support 1.000000 -0.042314 0.279659 -0.355278 0.412882
data -0.042314 1.000000 0.104377 0.149224 -0.105533
how 0.279659 0.104377 1.000000 -0.227517 -0.197683
documentation -0.355278 0.149224 -0.227517 1.000000 -0.174243
guide 0.412882 -0.105533 -0.197683 -0.174243 1.000000

Visualize with a heatmap

If we immediately create the heatmap of the correlations, we will get a diagonal showing a perfect 1.0 correlation between each term and itself, which would be misleading.

So, before visualizing, we can convert those values on the diagonal of the matrix to Nones, then visualize.

corr = [[corr_matrix.values[i][j] if i != j else None
         for j in range(len(corr_matrix))]
        for i in range(len(corr_matrix))]

corr_df = pd.DataFrame(
    corr,
    columns=corr_matrix.columns,
    index=corr_matrix.columns)

fig = px.imshow(
    corr_df,
    height=900,
    width=900,
    color_continuous_scale='RdBu',
    color_continuous_midpoint=0,
    template='none',
    labels={'color': 'correlation'},
    title='Term correlation in page titles')
fig.layout.xaxis.side = 'top'
fig.layout.xaxis.showgrid = False
fig.layout.yaxis.showgrid = False
fig.layout.xaxis.tickangle = -38
fig.layout.coloraxis.colorbar.outlinecolor = 'white'
fig.layout.coloraxis.colorbar.title = 'Correlation<br>'
fig.layout.margin.t = 150
fig

Possible improvements and next steps

  • Filter for a subset of URLs (community pages, shop, etc.)
  • Filter for a subset of words under a certain category
  • Add counts and correlations as text to the chart
  • Use another crterion for selecting words (other than most used)
df[['title', 'term_schema', 'term_standard', 'term_warehouse']].sort_values('term_schema', ascending=False).head(15)['title'].tolist()
['Supermetrics Documentation | Apple Public Data Standard Data Warehouse Schema',
 'Supermetrics Documentation | Snapchat Marketing Standard Data Warehouse Schema',
 'Supermetrics Documentation | Rakuten Advertising Standard Data Warehouse Schema',
 'Supermetrics Documentation | Bing Webmaster Tools Standard Data Warehouse Schema',
 'Supermetrics Documentation | Prisjakt Standard Data Warehouse Schema',
 'Supermetrics Documentation | Piwik PRO Standard Data Warehouse Schema',
 'Supermetrics Documentation | Partnerize Standard Data Warehouse Schema',
 'Supermetrics Documentation | Pinterest Organic (Legacy) Standard Data Warehouse Schema',
 'Supermetrics Documentation | Quora Ads Standard Data Warehouse Schema',
 'Supermetrics Documentation | Nielsen Digital Ad Ratings Standard Data Warehouse Schema',
 'Supermetrics Documentation | Nexxen DSP Standard Data Warehouse Schema',
 'Supermetrics Documentation | YouTube Standard Data Warehouse Schema',
 'Supermetrics Documentation | Matomo Standard Data Warehouse Schema',
 'Supermetrics Documentation | MediaMath Standard Data Warehouse Schema',
 'Supermetrics Documentation | X Public Data (Twitter) Standard Data Warehouse Schema']