Evaluating Content in Bulk with LLMs

Evaluating content

It is not a straightforward process, and is a subjective one as well. What you might evaluate as OK, I might evaluate as great, and so on. The other challenge is that this process is not scalable partly due to these very reasons. People need to read, make an evaluation according to a set of criteria, and somehow summarize the results.

What can be done to make this process easier is to structure the process by checking for clear criteria, and whether or not a piece of content achieves them. This can be the sweet spot where LLMs can help us: the problem is not highly structured as to be solved by a regex, or standard text analysis, yet it is not so unstructured that it requires experts to issue a full report on the quality of the content.

The plan:

Take an article
Create a set of questions about the article where the answers can either be True or False
Let the LLM evaluate and check if the article achieves any of the criteria
Repeat the same process with many articles with bulk prompting
Create a summary report

There are many ways to evaluate content, and one quite important one is Google’s helpful content gudelines. These are a set of subjective but relatively structured questions that they use to evaluate content.

The report issued by evaluating and scoring according to these criteria is never going to be perfect, and small differences between articles won’t mean anything. If article A “achieves” 90% of the criteria, and article B achieves 89%, that’s not a meaningful comparison. We would be looking for large differences and checking if certain article miss, let’s say, half the criteria. Or check overall how our articles are doing according to a certain crierion, for example: >Does the main heading or page title avoid exaggerating or being shocking in nature?

and seeing what percentage of our content seems to be exaggerated or shocking.

Here is a sample of the criteria

Code

import os
import json
import advertools as adv
import pandas as pd
from openai import OpenAI
client = OpenAI(api_key=os.environ['OPENAI_API_KEY'])

Guidelines criteria and their categories

A sample of criteria/questions from each category. You can download the full helpful content creteria list if you are interested.

	category	question
0	Content and quality questions	Does the content provide original information, reporting, research, or analysis?
1	Content and quality questions	Does the content provide a substantial, complete, or comprehensive description of the topic?
2	Content and quality questions	Does the content provide insightful analysis or interesting information that is beyond the obvious?
3	Content and quality questions	If the content draws on other sources, does it avoid simply copying or rewriting those sources, and instead provide ...
4	Content and quality questions	Does the main heading or page title provide a descriptive, helpful summary of the content?
12	Expertise questions	Does the content present information in a way that makes you want to trust it, such as clear sourcing, evidence of t...
13	Expertise questions	If someone researched the site producing the content, would they come away with an impression that it is well-truste...
14	Expertise questions	Is this content written or reviewed by an expert or enthusiast who demonstrably knows the topic well?
15	Expertise questions	Does the content have any easily-verified factual errors?
16	Focus on people-first content	Do you have an existing or intended audience for your business or site that would find the content useful if they ca...
17	Focus on people-first content	Does your content clearly demonstrate first-hand expertise and a depth of knowledge (for example, expertise that com...
18	Focus on people-first content	Does your site have a primary purpose or focus?
19	Focus on people-first content	After reading your content, will someone leave feeling they've learned enough about a topic to help achieve their goal?
20	Focus on people-first content	Will someone reading your content leave feeling like they've had a satisfying experience?
21	Avoid creating search engine-first content	Is the content primarily made to attract visits from search engines?
22	Avoid creating search engine-first content	Are you producing lots of content on many different topics in hopes that some of it might perform well in search res...
23	Avoid creating search engine-first content	Are you using extensive automation to produce content on many topics?
24	Avoid creating search engine-first content	Are you mainly summarizing what others have to say without adding much value?
25	Avoid creating search engine-first content	Are you writing about things simply because they seem trending and not because you'd write about them otherwise for ...
32	Who (created the content)	Is it self-evident to your visitors who authored your content?
33	Who (created the content)	Do pages carry a byline, where one might be expected?
34	Who (created the content)	Do bylines lead to further information about the author or authors involved, giving background about them and the ar...
35	How (the content was created)	Is the use of automation, including AI-generation, self-evident to visitors through disclosures or in other ways?
36	How (the content was created)	Are you providing background about how automation or AI-generation was used to create content?
37	How (the content was created)	Are you explaining why automation or AI was seen as useful to produce content?

Get content to evaluate

We now need a set of articles to evaluate and we can easily do so by crawling a certain website, and extracing the main content with custom extraction.


1sitemap = adv.sitemap_to_df('https://nbastats.pro/robots.txt')

2url_list = sitemap['loc'].sample(50)

adv.crawl(
    url_list=url_list,
    output_file='nbastats_crawl.jl',
    custom_settings={
        'LOG_FILE': 'nbastats_crawl.log',
    },
    xpath_selectors={
3        'player_description': '//div[@class="col-lg-10 col-md-9"]/text() | //div//span[@id="more"]/text()'
    })

1: Get the URLs from a sitemap
2: Random list of URLs to crawl
3: Special XPath selector to exract the article text

Sample rows and columns of the crawl dataset

	url	h1	player_description
0	https://nbastats.pro/player/John_Niemiera	John Niemiera Stats: NBA Career	Introducing John Niemiera: The Detroit Pistons' Sharpshooter@@When it comes to basketball, there are players who lea...
1	https://nbastats.pro/player/Maceo_Baston	Maceo Baston Stats: NBA Career	Maceo Baston: A Defensive Force on the Court@@When it comes to analyzing the impact of a basketball player, statisti...
2	https://nbastats.pro/player/Marko_Jaric	Marko Jaric Stats: NBA Career	Welcome to the profile page of Marko Jaric, a skilled and versatile NBA basketball player who has left his mark on t...
3	https://nbastats.pro/player/Elijah_Hughes	Elijah Hughes Stats: NBA Career	Elijah Hughes: Unveiling the Lesser-Known Talents of a Rising NBA Star@@In the world of professional basketball, the...
4	https://nbastats.pro/player/Lou_Tsioropoulos	Lou Tsioropoulos Stats: NBA Career	Meet Lou Tsioropoulos, a former NBA player who made his mark on the court during his time with the Boston Celtics. T...

Evaluate the content with OpenAI’s API

For each article and its main heading:
    Send the article, and its heading together with the set of questions to ask/check
Combine all responses in one DataFrame
Get averages and/or counts for each criterion

Send articles with questions

responses = []

1for url, h1, description in crawldf[['url', 'h1', 'player_description']].values:
    print(h1)
    completion = client.chat.completions.create(
        model="gpt-4o",
2        temperature=0,
3        seed=123,
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": f"""
Please answer the following questions about this article and its title.
Respond in JSON where questions are keys and answers are values.
Answers should be boolean only.

article_title: {h1}

aritcle_text: {description.replace('@@', ' ')}


questions: {content_guidelines.head(12)['question'].tolist()}

"""}
        ])
    responses.append((url, h1, completion))

1: Evaluate the article together with the main heading
2: Set temperature at zero to minimize randomization (we want straightforward true/false answers)
3: Set a custom seed to reproduce the same output given the same input

Combine responses

1response_dfs = []
for url, h1, response in responses:
    df = pd.DataFrame(json.loads(response.dict()['choices'][0]['message']['content'][7:-3]).items())
    df['title'] = h1
    df['url'] = url
    response_dfs.append(df)

evaluation = pd.concat(response_dfs,ignore_index=True).rename(columns={0: 'question', 1: 'answer'})

1: Combine all responses into one DataFrame

Evaluation samples

	question	answer	title	url
0	Does the content provide original information, reporting, research, or analysis?	True	John Niemiera Stats: NBA Career	https://nbastats.pro/player/John_Niemiera
1	Does the content provide a substantial, complete, or comprehensive description of the topic?	True	John Niemiera Stats: NBA Career	https://nbastats.pro/player/John_Niemiera
2	Does the content provide insightful analysis or interesting information that is beyond the obvious?	True	John Niemiera Stats: NBA Career	https://nbastats.pro/player/John_Niemiera
12	Does the content provide original information, reporting, research, or analysis?	True	Maceo Baston Stats: NBA Career	https://nbastats.pro/player/Maceo_Baston
13	Does the content provide a substantial, complete, or comprehensive description of the topic?	True	Maceo Baston Stats: NBA Career	https://nbastats.pro/player/Maceo_Baston
14	Does the content provide insightful analysis or interesting information that is beyond the obvious?	True	Maceo Baston Stats: NBA Career	https://nbastats.pro/player/Maceo_Baston
24	Does the content provide original information, reporting, research, or analysis?	False	Marko Jaric Stats: NBA Career	https://nbastats.pro/player/Marko_Jaric
25	Does the content provide a substantial, complete, or comprehensive description of the topic?	True	Marko Jaric Stats: NBA Career	https://nbastats.pro/player/Marko_Jaric
26	Does the content provide insightful analysis or interesting information that is beyond the obvious?	False	Marko Jaric Stats: NBA Career	https://nbastats.pro/player/Marko_Jaric

Evaluation summary

Averages

Code

(evaluation
 .rename(columns={'question': 'criterion', 'answer': 'evaluation'})
 .groupby(['criterion'])
 ['evaluation']
 .mean()
 .reset_index()
 .style
 .format({'evaluation': '{:.0%}'})
 .bar(subset=['evaluation'], color='lightgray'))

	criterion	evaluation
0	Does the content have any spelling or stylistic issues?	54%
1	Does the content provide a substantial, complete, or comprehensive description of the topic?	84%
2	Does the content provide insightful analysis or interesting information that is beyond the obvious?	54%
3	Does the content provide original information, reporting, research, or analysis?	52%
4	Does the content provide substantial value when compared to other pages in search results?	48%
5	Does the main heading or page title avoid exaggerating or being shocking in nature?	100%
6	Does the main heading or page title provide a descriptive, helpful summary of the content?	100%
7	If the content draws on other sources, does it avoid simply copying or rewriting those sources, and instead provide substantial additional value and originality?	52%
8	Is the content mass-produced by or outsourced to a large number of creators, or spread across a large network of sites, so that individual pages or sites don't get as much attention or care?	0%
9	Is the content produced well, or does it appear sloppy or hastily produced?	26%
10	Is this the sort of page you'd want to bookmark, share with a friend, or recommend?	48%
11	Would you expect to see this content in or referenced by a printed magazine, encyclopedia, or book?	48%

Counts

Code

(evaluation
 .rename(columns={'question': 'criterion', 'answer': 'evaluation'})
 .groupby(['criterion'])
 ['evaluation']
 .sum()
 .reset_index()
 .style
 .format({'evaluation': '{:,}'})
 .bar(subset=['evaluation'], color='lightgray'))

	criterion	evaluation
0	Does the content have any spelling or stylistic issues?	27
1	Does the content provide a substantial, complete, or comprehensive description of the topic?	42
2	Does the content provide insightful analysis or interesting information that is beyond the obvious?	27
3	Does the content provide original information, reporting, research, or analysis?	26
4	Does the content provide substantial value when compared to other pages in search results?	24
5	Does the main heading or page title avoid exaggerating or being shocking in nature?	50
6	Does the main heading or page title provide a descriptive, helpful summary of the content?	50
7	If the content draws on other sources, does it avoid simply copying or rewriting those sources, and instead provide substantial additional value and originality?	26
8	Is the content mass-produced by or outsourced to a large number of creators, or spread across a large network of sites, so that individual pages or sites don't get as much attention or care?	0
9	Is the content produced well, or does it appear sloppy or hastily produced?	13
10	Is this the sort of page you'd want to bookmark, share with a friend, or recommend?	24
11	Would you expect to see this content in or referenced by a printed magazine, encyclopedia, or book?	24

It’s quite interesting interesting that zero articles were evaluated as mass-produced. Those articles were actually produced by ChatGPT itself. ChatGPT is transcending itself!

And now we have a structured report evaluating fifty sample articles, each according to twelve structured questions. We can easily filter and chck according to any criteria we want.

There are actually thirty eight guidelines to check, and we can do the same with the full set of questions and for all the URLs for a more detailed evaluation.