Inverting a mapping

A simple and powerful script to invert a mapping where each key is mapped to multiple values. The result is that each value will be mapped to multiple keys.
python
programming
content
data manipulation
Author

Elias Dabbas

Published

June 15, 2024

What is a mapping, and what does it mean to invert it?

To make it immediately clear what we are trying to do, let assume we have the following mapping of movies to genres (each movies naturally belonging to multiple genres):

movie genres
MOVIE_1 action, comedy
MOVIE_2 action, drama
MOVIE_3 comedy, crime, action

Inverting this mapping means that we want to look at things from the other side. We want the genre to be on the left column, and for each genre, a list of movies that fall under it:

genre movies
action MOVIE_1, MOVIE_2, MOVIE_3
comedy MOVIE_1, MOVIE_3
drama MOVIE_2
crime MOVIE_3

We can clearly see which are the most popular genres, and we can get the counts of movies under each genre.

I have a more extensive example notebook on inverting mappings of movies and genres if you are interested.

Inverting the mappings of news articles and their topics/keywords

Let’s run another example where we have URLs. Each URL is classified with a set of keywords or tags, and we now want to do the same thing and analyze things from the keyword point of view.

import advertools as adv
import pandas as pd
from collections import defaultdict  # our main tool for inverting the mapping
sitemap = adv.sitemap_to_df('https://www.nytimes.com/sitemaps/new/news.xml.gz')
sitemap[['loc', 'news_keywords']].head()
loc news_keywords
0 https://www.nytimes.com/athletic/5565896/2024/... NaN
1 https://www.nytimes.com/2024/06/14/opinion/mis... Freedom of the Press, News Sources, Confidenti...
2 https://www.nytimes.com/athletic/5565934/2024/... NaN
3 https://www.nytimes.com/es/2024/06/14/espanol/... Royal Families, Parades, Cancer, Birthdays, Ke...
4 https://www.nytimes.com/athletic/live-blogs/ce... NaN

Split the keywords:

sitemap['keywords'] = sitemap['news_keywords'].str.split(', ')
sitemap[['loc', 'keywords']].head()
loc keywords
0 https://www.nytimes.com/athletic/5565896/2024/... NaN
1 https://www.nytimes.com/2024/06/14/opinion/mis... [Freedom of the Press, News Sources, Confident...
2 https://www.nytimes.com/athletic/5565934/2024/... NaN
3 https://www.nytimes.com/es/2024/06/14/espanol/... [Royal Families, Parades, Cancer, Birthdays, K...
4 https://www.nytimes.com/athletic/live-blogs/ce... NaN

Or:

sitemap[['loc', 'keywords']].dropna(subset=['keywords'])[:4].values
array([['https://www.nytimes.com/2024/06/14/opinion/mississippi-press-freedom-republicans.html',
        list(['Freedom of the Press', 'News Sources', 'Confidential Status of', 'News and News Media', 'Supreme Courts (State)', 'Ethics and Official Misconduct', 'Mississippi Today', 'Supreme Court (US)', 'Bryant', 'Phil', 'DeSantis', 'Ron', 'Reeves', 'Tate (1974- )', 'Mississippi'])],
       ['https://www.nytimes.com/es/2024/06/14/espanol/kate-middleton-cumpleanos-rey-carlos.html',
        list(['Royal Families', 'Parades', 'Cancer', 'Birthdays', 'Kensington Palace', 'Catherine', 'Princess of Wales', 'Charles III', 'King of the United Kingdom', 'Great Britain'])],
       ['https://www.nytimes.com/2024/06/14/us/politics/infowars-bankruptcy-alex-jones.html',
        list(['Compensation for Damages (Law)', 'Decisions and Verdicts', 'Suits and Litigation (Civil)', 'Rumors and Misinformation', 'Bankruptcies', 'Jones', 'Alex (1974- )', 'Infowars', 'Sandy Hook Elementary School (Newtown', 'Conn)', 'School Shootings and Armed Attacks'])],
       ['https://www.nytimes.com/live/2024/06/14/world/israel-gaza-war-hamas',
        list(['Israel-Gaza War (2023- )', 'Hamas', 'Palestinians', 'Gaza Strip', 'Israel', 'Rafah (Gaza Strip)', 'internal-open-access-search'])]],
      dtype=object)

Create a defaultdict which takes a list as its default value for non-existing keys.

This is similar to the normal Python dict, but it behaves slightly differently.

When you want to get a key from a dict and it doesn’t exit, Python raises a KeyError.

With a defaultdict, we provide a default value (function) for the cases where the key does not exist, and we get the normal behavior when the key exists.

In this example we will create one with a list. Now we will go throgh each URL:keywords mapping one by one:

For every URL and its keywords
    For every keyword in keywords
        Check if that keyword exists as a key in the defaultdict
        If it does, append to its list the current URL
        If it does not, create an empty list, and append the current URL to it.
dd = defaultdict(list)
for url, keywords in sitemap[['loc', 'keywords']].dropna().values:
    for keyword in keywords:
        dd[keyword].append(url)

Create a DataFrame with the inverted keys and values

keyword_url = pd.DataFrame(dd.items(), columns=['keyword', 'articles'])
keyword_url['count'] = keyword_url['articles'].str.len()
print(keyword_url.shape)
keyword_url.sort_values('count', ascending=False).head(10)
(1734, 3)
keyword articles count
44 United States Politics and Government [https://www.nytimes.com/live/2024/06/14/us/bi... 41
45 Trump [https://www.nytimes.com/live/2024/06/14/us/bi... 34
46 Donald J [https://www.nytimes.com/live/2024/06/14/us/bi... 34
47 Biden [https://www.nytimes.com/live/2024/06/14/us/bi... 32
43 Presidential Election of 2024 [https://www.nytimes.com/live/2024/06/14/us/bi... 30
48 Joseph R Jr [https://www.nytimes.com/live/2024/06/14/us/bi... 29
194 Politics and Government [https://www.nytimes.com/2024/06/14/world/afri... 22
197 International Relations [https://www.nytimes.com/2024/06/14/world/afri... 21
7 Supreme Court (US) [https://www.nytimes.com/2024/06/14/opinion/mi... 20
184 Movies [https://www.nytimes.com/2024/06/14/arts/telev... 19

We now have a new mapping, and we also have the counts of articles under each keyword. It is also straighforward to see the list of articles published under the desired keyword.

Going back to movies:

keyword_url['articles'][184]
['https://www.nytimes.com/2024/06/14/arts/television/martin-starger-dead.html',
 'https://www.nytimes.com/2024/06/14/well/mind/inside-out-2-anxiety.html',
 'https://www.nytimes.com/2024/06/14/movies/action-movies-streaming.html',
 'https://www.nytimes.com/2024/06/14/movies/new-movies-roundup.html',
 'https://www.nytimes.com/interactive/2024/06/14/arts/weekend-editors-picks-inside-out2.html',
 'https://www.nytimes.com/interactive/2024/06/14/books/new-paperbacks-page.html',
 'https://www.nytimes.com/2024/06/14/movies/inside-out-2-clip.html',
 'https://www.nytimes.com/2024/06/14/technology/remo-saraceni-dead.html',
 'https://www.nytimes.com/2024/06/13/special-series/robert-eggers-the-witch-fear.html',
 'https://www.nytimes.com/2024/06/13/opinion/return-travel-venice.html',
 'https://www.nytimes.com/2024/06/13/movies/brat-pack-documentary-takeaways.html',
 'https://www.nytimes.com/2024/06/13/arts/shoeshine-movie.html',
 'https://www.nytimes.com/2024/06/13/movies/treasure-review.html',
 'https://www.nytimes.com/2024/06/13/movies/summer-solstice-review.html',
 'https://www.nytimes.com/2024/06/13/movies/reverse-the-curse-review.html',
 'https://www.nytimes.com/2024/06/13/movies/firebrand-review.html',
 'https://www.nytimes.com/2024/06/13/movies/tiger-stripes-review.html',
 'https://www.nytimes.com/2024/06/13/movies/ride-review.html',
 'https://www.nytimes.com/2024/06/13/movies/ghostlight-review.html']