What is a mapping, and what does it mean to invert it?
To make it immediately clear what we are trying to do, let assume we have the following mapping of movies to genres (each movies naturally belonging to multiple genres):
movie | genres |
---|---|
MOVIE_1 | action, comedy |
MOVIE_2 | action, drama |
MOVIE_3 | comedy, crime, action |
Inverting this mapping means that we want to look at things from the other side. We want the genre to be on the left column, and for each genre, a list of movies that fall under it:
genre | movies |
---|---|
action | MOVIE_1, MOVIE_2, MOVIE_3 |
comedy | MOVIE_1, MOVIE_3 |
drama | MOVIE_2 |
crime | MOVIE_3 |
We can clearly see which are the most popular genres, and we can get the counts of movies under each genre.
I have a more extensive example notebook on inverting mappings of movies and genres if you are interested.
Inverting the mappings of news articles and their topics/keywords
Let’s run another example where we have URLs. Each URL is classified with a set of keywords or tags, and we now want to do the same thing and analyze things from the keyword point of view.
loc | news_keywords | |
---|---|---|
0 | https://www.nytimes.com/athletic/5565896/2024/... | NaN |
1 | https://www.nytimes.com/2024/06/14/opinion/mis... | Freedom of the Press, News Sources, Confidenti... |
2 | https://www.nytimes.com/athletic/5565934/2024/... | NaN |
3 | https://www.nytimes.com/es/2024/06/14/espanol/... | Royal Families, Parades, Cancer, Birthdays, Ke... |
4 | https://www.nytimes.com/athletic/live-blogs/ce... | NaN |
Split the keywords:
loc | keywords | |
---|---|---|
0 | https://www.nytimes.com/athletic/5565896/2024/... | NaN |
1 | https://www.nytimes.com/2024/06/14/opinion/mis... | [Freedom of the Press, News Sources, Confident... |
2 | https://www.nytimes.com/athletic/5565934/2024/... | NaN |
3 | https://www.nytimes.com/es/2024/06/14/espanol/... | [Royal Families, Parades, Cancer, Birthdays, K... |
4 | https://www.nytimes.com/athletic/live-blogs/ce... | NaN |
Or:
array([['https://www.nytimes.com/2024/06/14/opinion/mississippi-press-freedom-republicans.html',
list(['Freedom of the Press', 'News Sources', 'Confidential Status of', 'News and News Media', 'Supreme Courts (State)', 'Ethics and Official Misconduct', 'Mississippi Today', 'Supreme Court (US)', 'Bryant', 'Phil', 'DeSantis', 'Ron', 'Reeves', 'Tate (1974- )', 'Mississippi'])],
['https://www.nytimes.com/es/2024/06/14/espanol/kate-middleton-cumpleanos-rey-carlos.html',
list(['Royal Families', 'Parades', 'Cancer', 'Birthdays', 'Kensington Palace', 'Catherine', 'Princess of Wales', 'Charles III', 'King of the United Kingdom', 'Great Britain'])],
['https://www.nytimes.com/2024/06/14/us/politics/infowars-bankruptcy-alex-jones.html',
list(['Compensation for Damages (Law)', 'Decisions and Verdicts', 'Suits and Litigation (Civil)', 'Rumors and Misinformation', 'Bankruptcies', 'Jones', 'Alex (1974- )', 'Infowars', 'Sandy Hook Elementary School (Newtown', 'Conn)', 'School Shootings and Armed Attacks'])],
['https://www.nytimes.com/live/2024/06/14/world/israel-gaza-war-hamas',
list(['Israel-Gaza War (2023- )', 'Hamas', 'Palestinians', 'Gaza Strip', 'Israel', 'Rafah (Gaza Strip)', 'internal-open-access-search'])]],
dtype=object)
Create a defaultdict
which takes a list as its default value for non-existing keys.
This is similar to the normal Python dict
, but it behaves slightly differently.
When you want to get a key from a dict
and it doesn’t exit, Python raises a KeyError
.
With a defaultdict
, we provide a default value (function) for the cases where the key does not exist, and we get the normal behavior when the key exists.
In this example we will create one with a list. Now we will go throgh each URL:keywords mapping one by one:
Create a DataFrame with the inverted keys and values
keyword_url = pd.DataFrame(dd.items(), columns=['keyword', 'articles'])
keyword_url['count'] = keyword_url['articles'].str.len()
print(keyword_url.shape)
keyword_url.sort_values('count', ascending=False).head(10)
(1734, 3)
keyword | articles | count | |
---|---|---|---|
44 | United States Politics and Government | [https://www.nytimes.com/live/2024/06/14/us/bi... | 41 |
45 | Trump | [https://www.nytimes.com/live/2024/06/14/us/bi... | 34 |
46 | Donald J | [https://www.nytimes.com/live/2024/06/14/us/bi... | 34 |
47 | Biden | [https://www.nytimes.com/live/2024/06/14/us/bi... | 32 |
43 | Presidential Election of 2024 | [https://www.nytimes.com/live/2024/06/14/us/bi... | 30 |
48 | Joseph R Jr | [https://www.nytimes.com/live/2024/06/14/us/bi... | 29 |
194 | Politics and Government | [https://www.nytimes.com/2024/06/14/world/afri... | 22 |
197 | International Relations | [https://www.nytimes.com/2024/06/14/world/afri... | 21 |
7 | Supreme Court (US) | [https://www.nytimes.com/2024/06/14/opinion/mi... | 20 |
184 | Movies | [https://www.nytimes.com/2024/06/14/arts/telev... | 19 |
We now have a new mapping, and we also have the counts of articles under each keyword. It is also straighforward to see the list of articles published under the desired keyword.
Going back to movies:
['https://www.nytimes.com/2024/06/14/arts/television/martin-starger-dead.html',
'https://www.nytimes.com/2024/06/14/well/mind/inside-out-2-anxiety.html',
'https://www.nytimes.com/2024/06/14/movies/action-movies-streaming.html',
'https://www.nytimes.com/2024/06/14/movies/new-movies-roundup.html',
'https://www.nytimes.com/interactive/2024/06/14/arts/weekend-editors-picks-inside-out2.html',
'https://www.nytimes.com/interactive/2024/06/14/books/new-paperbacks-page.html',
'https://www.nytimes.com/2024/06/14/movies/inside-out-2-clip.html',
'https://www.nytimes.com/2024/06/14/technology/remo-saraceni-dead.html',
'https://www.nytimes.com/2024/06/13/special-series/robert-eggers-the-witch-fear.html',
'https://www.nytimes.com/2024/06/13/opinion/return-travel-venice.html',
'https://www.nytimes.com/2024/06/13/movies/brat-pack-documentary-takeaways.html',
'https://www.nytimes.com/2024/06/13/arts/shoeshine-movie.html',
'https://www.nytimes.com/2024/06/13/movies/treasure-review.html',
'https://www.nytimes.com/2024/06/13/movies/summer-solstice-review.html',
'https://www.nytimes.com/2024/06/13/movies/reverse-the-curse-review.html',
'https://www.nytimes.com/2024/06/13/movies/firebrand-review.html',
'https://www.nytimes.com/2024/06/13/movies/tiger-stripes-review.html',
'https://www.nytimes.com/2024/06/13/movies/ride-review.html',
'https://www.nytimes.com/2024/06/13/movies/ghostlight-review.html']