The plan:
- Get a list of documents (crawl file title tags, h1 tags, keywords, product descriptions, etc.)
- Find the top N most frequently used words/terms
- Locate the documents where these terms are used
- Compute the correlation of these terms
- Visualize with a heatmap
Getting a list of documents
In this example I will use title tags from a crawl file. You can easily use any other text column and run the same code again.
url | title | |
---|---|---|
0 | https://supermetrics.com/ | Supermetrics: Turn your marketing data into op... |
1 | https://hub.supermetrics.com | Log In Or Create An Account - Supermetrics |
2 | https://supermetrics.com/book-demo | Book a free demo of the Supermetrics marketing... |
3 | https://supermetrics.com/contact | Contact Supermetrics - Supermetrics |
4 | https://supermetrics.com/start-trial | Start Free Trial - Supermetrics |
Finding the top N
most frequent terms
We have the flexibility of getting words, bigrams, trigrams, or any pattern we want to use, as well as remove any stopwords that we don’t think are relevant.
In this case, the name of the website is removed, because it is present in every title.
N = 25
rm_words = [
'-', '|', 'to', 'and', 'in', 'for', 'a', 'the', 'with', 'your', 'of',
'&', 'supermetrics'
]
top_words = adv.word_frequency(
df['title'].dropna(),
rm_words=rm_words).head(N)
top_words.head(10)
word | abs_freq | |
---|---|---|
0 | support | 1031 |
1 | data | 664 |
2 | how | 549 |
3 | documentation | 392 |
4 | guide | 354 |
5 | 352 | |
6 | marketing | 232 |
7 | ads | 207 |
8 | connection | 194 |
9 | studio | 183 |
Locating the documents where the top terms are used
This is done by creating new boolean columns that check whether or not the document (page title in this case) matches the term.
Keep in mind that there are regex pitfalls over here. The pattern “love” would match the word “glove” but that’s not what we want.
One way to handle this is to add the word boundary character \b
before and after the words.
for word in top_words['word'][:N]:
df[f'term_{word}'] = df['title'].str.lower().str.contains(r'\b' + word + r'\b', regex=True)
df.filter(regex='^title$|term_').iloc[:5, :5]
title | term_support | term_data | term_how | term_documentation | |
---|---|---|---|---|---|
0 | Supermetrics: Turn your marketing data into op... | False | True | False | False |
1 | Log In Or Create An Account - Supermetrics | False | False | False | False |
2 | Book a free demo of the Supermetrics marketing... | False | True | False | False |
3 | Contact Supermetrics - Supermetrics | False | False | False | False |
4 | Start Free Trial - Supermetrics | False | False | False | False |
We now remove the beginning string “term_” from column names.
Compute the correlation between the terms
support | data | how | documentation | guide | |
---|---|---|---|---|---|
support | 1.000000 | -0.042314 | 0.279659 | -0.355278 | 0.412882 |
data | -0.042314 | 1.000000 | 0.104377 | 0.149224 | -0.105533 |
how | 0.279659 | 0.104377 | 1.000000 | -0.227517 | -0.197683 |
documentation | -0.355278 | 0.149224 | -0.227517 | 1.000000 | -0.174243 |
guide | 0.412882 | -0.105533 | -0.197683 | -0.174243 | 1.000000 |
Visualize with a heatmap
If we immediately create the heatmap of the correlations, we will get a diagonal showing a perfect 1.0 correlation between each term and itself, which would be misleading.
So, before visualizing, we can convert those values on the diagonal of the matrix to None
s, then visualize.
corr = [[corr_matrix.values[i][j] if i != j else None
for j in range(len(corr_matrix))]
for i in range(len(corr_matrix))]
corr_df = pd.DataFrame(
corr,
columns=corr_matrix.columns,
index=corr_matrix.columns)
fig = px.imshow(
corr_df,
height=900,
width=900,
color_continuous_scale='RdBu',
color_continuous_midpoint=0,
template='none',
labels={'color': 'correlation'},
title='Term correlation in page titles')
fig.layout.xaxis.side = 'top'
fig.layout.xaxis.showgrid = False
fig.layout.yaxis.showgrid = False
fig.layout.xaxis.tickangle = -38
fig.layout.coloraxis.colorbar.outlinecolor = 'white'
fig.layout.coloraxis.colorbar.title = 'Correlation<br>'
fig.layout.margin.t = 150
fig
Possible improvements and next steps
- Filter for a subset of URLs (community pages, shop, etc.)
- Filter for a subset of words under a certain category
- Add counts and correlations as text to the chart
- Use another crterion for selecting words (other than most used)
df[['title', 'term_schema', 'term_standard', 'term_warehouse']].sort_values('term_schema', ascending=False).head(15)['title'].tolist()
['Supermetrics Documentation | Apple Public Data Standard Data Warehouse Schema',
'Supermetrics Documentation | Snapchat Marketing Standard Data Warehouse Schema',
'Supermetrics Documentation | Rakuten Advertising Standard Data Warehouse Schema',
'Supermetrics Documentation | Bing Webmaster Tools Standard Data Warehouse Schema',
'Supermetrics Documentation | Prisjakt Standard Data Warehouse Schema',
'Supermetrics Documentation | Piwik PRO Standard Data Warehouse Schema',
'Supermetrics Documentation | Partnerize Standard Data Warehouse Schema',
'Supermetrics Documentation | Pinterest Organic (Legacy) Standard Data Warehouse Schema',
'Supermetrics Documentation | Quora Ads Standard Data Warehouse Schema',
'Supermetrics Documentation | Nielsen Digital Ad Ratings Standard Data Warehouse Schema',
'Supermetrics Documentation | Nexxen DSP Standard Data Warehouse Schema',
'Supermetrics Documentation | YouTube Standard Data Warehouse Schema',
'Supermetrics Documentation | Matomo Standard Data Warehouse Schema',
'Supermetrics Documentation | MediaMath Standard Data Warehouse Schema',
'Supermetrics Documentation | X Public Data (Twitter) Standard Data Warehouse Schema']