Word Co-Occurrence Matrix – advertools Blog

The plan:

Get a list of documents (crawl file title tags, h1 tags, keywords, product descriptions, etc.)
Find the top N most frequently used words/terms
Locate the documents where these terms are used
Compute the correlation of these terms
Visualize with a heatmap

Getting a list of documents

In this example I will use title tags from a crawl file. You can easily use any other text column and run the same code again.

import advertools as adv
import pandas as pd
import plotly.express as px

df = pd.read_json('../compare-crawls/supermetrics_crawl2.jl', lines=True)

df[['url', 'title']].head()

	url	title
0	https://supermetrics.com/	Supermetrics: Turn your marketing data into op...
1	https://hub.supermetrics.com	Log In Or Create An Account - Supermetrics
2	https://supermetrics.com/book-demo	Book a free demo of the Supermetrics marketing...
3	https://supermetrics.com/contact	Contact Supermetrics - Supermetrics
4	https://supermetrics.com/start-trial	Start Free Trial - Supermetrics

Finding the top `N` most frequent terms

We have the flexibility of getting words, bigrams, trigrams, or any pattern we want to use, as well as remove any stopwords that we don’t think are relevant.

In this case, the name of the website is removed, because it is present in every title.

N = 25
rm_words = [
    '-', '|', 'to', 'and', 'in', 'for', 'a', 'the', 'with', 'your', 'of',
    '&', 'supermetrics'
]
top_words = adv.word_frequency(
    df['title'].dropna(),
    rm_words=rm_words).head(N)
top_words.head(10)

	word	abs_freq
0	support	1031
1	data	664
2	how	549
3	documentation	392
4	guide	354
5	google	352
6	marketing	232
7	ads	207
8	connection	194
9	studio	183

Locating the documents where the top terms are used

This is done by creating new boolean columns that check whether or not the document (page title in this case) matches the term.

Keep in mind that there are regex pitfalls over here. The pattern “love” would match the word “glove” but that’s not what we want.

One way to handle this is to add the word boundary character \b before and after the words.

r'\b' + "marketing" + r'\b'

'\\bmarketing\\b'

for word in top_words['word'][:N]:
    df[f'term_{word}'] = df['title'].str.lower().str.contains(r'\b' + word + r'\b', regex=True)
df.filter(regex='^title$|term_').iloc[:5, :5]

	title	term_support	term_data	term_how	term_documentation
0	Supermetrics: Turn your marketing data into op...	False	True	False	False
1	Log In Or Create An Account - Supermetrics	False	False	False	False
2	Book a free demo of the Supermetrics marketing...	False	True	False	False
3	Contact Supermetrics - Supermetrics	False	False	False	False
4	Start Free Trial - Supermetrics	False	False	False	False

We now remove the beginning string “term_” from column names.

term_df = df.filter(regex='term_')
term_df.columns = [col.replace('term_', '') for col in term_df.columns]

Compute the correlation between the terms

corr_matrix = term_df.corr()
corr_matrix.iloc[:5, :5]

	support	data	how	documentation	guide
support	1.000000	-0.042314	0.279659	-0.355278	0.412882
data	-0.042314	1.000000	0.104377	0.149224	-0.105533
how	0.279659	0.104377	1.000000	-0.227517	-0.197683
documentation	-0.355278	0.149224	-0.227517	1.000000	-0.174243
guide	0.412882	-0.105533	-0.197683	-0.174243	1.000000

Visualize with a heatmap

If we immediately create the heatmap of the correlations, we will get a diagonal showing a perfect 1.0 correlation between each term and itself, which would be misleading.

So, before visualizing, we can convert those values on the diagonal of the matrix to Nones, then visualize.

corr = [[corr_matrix.values[i][j] if i != j else None
         for j in range(len(corr_matrix))]
        for i in range(len(corr_matrix))]

corr_df = pd.DataFrame(
    corr,
    columns=corr_matrix.columns,
    index=corr_matrix.columns)

fig = px.imshow(
    corr_df,
    height=900,
    width=900,
    color_continuous_scale='RdBu',
    color_continuous_midpoint=0,
    template='none',
    labels={'color': 'correlation'},
    title='Term correlation in page titles')
fig.layout.xaxis.side = 'top'
fig.layout.xaxis.showgrid = False
fig.layout.yaxis.showgrid = False
fig.layout.xaxis.tickangle = -38
fig.layout.coloraxis.colorbar.outlinecolor = 'white'
fig.layout.coloraxis.colorbar.title = 'Correlation<br>'
fig.layout.margin.t = 150
fig

Possible improvements and next steps

Filter for a subset of URLs (community pages, shop, etc.)
Filter for a subset of words under a certain category
Add counts and correlations as text to the chart
Use another crterion for selecting words (other than most used)

df[['title', 'term_schema', 'term_standard', 'term_warehouse']].sort_values('term_schema', ascending=False).head(15)['title'].tolist()

['Supermetrics Documentation | Apple Public Data Standard Data Warehouse Schema',
 'Supermetrics Documentation | Snapchat Marketing Standard Data Warehouse Schema',
 'Supermetrics Documentation | Rakuten Advertising Standard Data Warehouse Schema',
 'Supermetrics Documentation | Bing Webmaster Tools Standard Data Warehouse Schema',
 'Supermetrics Documentation | Prisjakt Standard Data Warehouse Schema',
 'Supermetrics Documentation | Piwik PRO Standard Data Warehouse Schema',
 'Supermetrics Documentation | Partnerize Standard Data Warehouse Schema',
 'Supermetrics Documentation | Pinterest Organic (Legacy) Standard Data Warehouse Schema',
 'Supermetrics Documentation | Quora Ads Standard Data Warehouse Schema',
 'Supermetrics Documentation | Nielsen Digital Ad Ratings Standard Data Warehouse Schema',
 'Supermetrics Documentation | Nexxen DSP Standard Data Warehouse Schema',
 'Supermetrics Documentation | YouTube Standard Data Warehouse Schema',
 'Supermetrics Documentation | Matomo Standard Data Warehouse Schema',
 'Supermetrics Documentation | MediaMath Standard Data Warehouse Schema',
 'Supermetrics Documentation | X Public Data (Twitter) Standard Data Warehouse Schema']