How to extract body text content from a webpage
Usually when crawling we are interested in extracting and analyzin the main content of a webapage. Whether this is a blog post, news article, product description, or an opinion piece, we want to get that text without the boilerplate (navigation, footers, etc.).
This is a flexible function that generates that XPath selector, and it can be modified easily. The main idea is based on the fact that some HTML tags are primarily used to display text (h1
, h2
, span
, bold
, etc), and others that aren’t (script
, img
, video
).
Text tags
You can easily change which tags you want to consider as text tags and which you don’t, depending on your the particular website you are crawling.
Non-text tags
In some cases there are tags that might contain text that you might be interested in. For example iframes, which might include interesting content in some cases. Using “text” to differentiate between the two categories of tags might not be ideal in some cases. Usually nav
elements contain text and links, but not as part of the main content of a page.
In any case you can modify the elements you want and get the final XPath expression.
def bodytext_xpath(
include_tags=[
"a",
"abbr",
"address",
"article",
"b",
"bdi",
"bdo",
"blockquote",
"caption",
"cite",
"code",
"data",
"dd",
"del",
"details",
"dfn",
"div",
"dl",
"dt",
"em",
"figcaption",
"h1",
"h2",
"h3",
"h4",
"h5",
"h6",
"header",
"i",
"ins",
"kbd",
"li",
"main",
"mark",
"menu",
"noscript",
"ol",
"p",
"pre",
"q",
"rp",
"rt",
"ruby",
"s",
"samp",
"section",
"small",
"span",
"strong",
"sub",
"summary",
"sup",
"time",
"u",
"ul",
"var",
],
exclude_tags=[
"area",
"aside",
"audio",
"base",
"button",
"canvas",
"col",
"colgroup",
"datalist",
"embed",
"fieldset",
"footer",
"form",
"head",
"iframe",
"img",
"input",
"label",
"legend",
"link",
"map",
"meta",
"nav",
"object",
"optgroup",
"option",
"output",
"param",
"picture",
"script",
"search",
"select",
"source",
"style",
"svg",
"table",
"tbody",
"td",
"textarea",
"tfoot",
"th",
"thead",
"title",
"tr",
"track",
"video",
],
):
"""Create an XPath expression to exract body text form a web page.
This approximates the page content that you are probably interested in (article text, product description, blog post, etc.)
You can change which tags are included and which are not by modifying the available parameters.
Parameters
----------
include_tags : list
A list of tags that typically contain text elements. You can add to it (if you want to include text in iframes, or td for example.)
exclude_tags : list
A list of tags that typically don't contain text (video, image, etc.). Make sure to remove the ones you don't want from this list if you want to modify the first one.
Returns
-------
xpath_selector : str
An XPath selector to be used in scraping the main body text of a webpage.
"""
common_elements = set(include_tags).intersection(exclude_tags)
if common_elements:
raise ValueError(f"Please make sure you don't include and exclude the same elements:{common_elements}")
include_tags.sort()
exclude_tags.sort()
include_tags_expr = "".join([f"self::{tag} or " for tag in include_tags])[:-4]
exclude_tags_expr = "".join(
[f"not(ancestor::{tag}) and " for tag in exclude_tags]
)[:-5]
final_exp = "//body//*[" + include_tags_expr + "][" + exclude_tags_expr + "]/text()"
return final_exp
Running this function give us the XPath expression that can be used with any SEO crawler if you want to try it:
//body//*[self::a or self::abbr or self::address or self::b or self::blockquote or self::cite or self::code or self::dd or self::del or self::div or self::dl or self::dt or self::em or self::h1 or self::h2 or self::h3 or self::h4 or self::h5 or self::h6 or self::header or self::i or self::ins or self::kbd or self::li or self::mark or self::ol or self::p or self::pre or self::q or self::section or self::small or self::span or self::strong or self::sub or self::sup or self::time or self::u or self::ul][not(ancestor::area) and not(ancestor::aside) and not(ancestor::audio) and not(ancestor::button) and not(ancestor::caption) and not(ancestor::col) and not(ancestor::colgroup) and not(ancestor::datalist) and not(ancestor::details) and not(ancestor::embed) and not(ancestor::fieldset) and not(ancestor::footer) and not(ancestor::form) and not(ancestor::head) and not(ancestor::iframe) and not(ancestor::img) and not(ancestor::input) and not(ancestor::label) and not(ancestor::legend) and not(ancestor::link) and not(ancestor::map) and not(ancestor::meta) and not(ancestor::nav) and not(ancestor::noscript) and not(ancestor::object) and not(ancestor::optgroup) and not(ancestor::option) and not(ancestor::output) and not(ancestor::param) and not(ancestor::picture) and not(ancestor::script) and not(ancestor::select) and not(ancestor::source) and not(ancestor::style) and not(ancestor::svg) and not(ancestor::table) and not(ancestor::tbody) and not(ancestor::td) and not(ancestor::textarea) and not(ancestor::tfoot) and not(ancestor::th) and not(ancestor::thead) and not(ancestor::title) and not(ancestor::tr) and not(ancestor::track) and not(ancestor::video)]/text()