Body Text XPath Selector

How to extract/scrape the main content of a webpage while crawling. A function that allows you to customize which tags to include/exclude.
crawling
scraping
advertools
xpath
Author

Elias Dabbas

Published

June 26, 2024

How to extract body text content from a webpage

Usually when crawling we are interested in extracting and analyzin the main content of a webapage. Whether this is a blog post, news article, product description, or an opinion piece, we want to get that text without the boilerplate (navigation, footers, etc.).

This is a flexible function that generates that XPath selector, and it can be modified easily. The main idea is based on the fact that some HTML tags are primarily used to display text (h1, h2, span, bold, etc), and others that aren’t (script, img, video).

Text tags

You can easily change which tags you want to consider as text tags and which you don’t, depending on your the particular website you are crawling.

Non-text tags

In some cases there are tags that might contain text that you might be interested in. For example iframes, which might include interesting content in some cases. Using “text” to differentiate between the two categories of tags might not be ideal in some cases. Usually nav elements contain text and links, but not as part of the main content of a page.

In any case you can modify the elements you want and get the final XPath expression.

def bodytext_xpath(
    include_tags=[
        "a",
        "abbr",
        "address",
        "article",
        "b",
        "bdi",
        "bdo",
        "blockquote",
        "caption",
        "cite",
        "code",
        "data",
        "dd",
        "del",
        "details",
        "dfn",
        "div",
        "dl",
        "dt",
        "em",
        "figcaption",
        "h1",
        "h2",
        "h3",
        "h4",
        "h5",
        "h6",
        "header",
        "i",
        "ins",
        "kbd",
        "li",
        "main",
        "mark",
        "menu",
        "noscript",
        "ol",
        "p",
        "pre",
        "q",
        "rp",
        "rt",
        "ruby",
        "s",
        "samp",
        "section",
        "small",
        "span",
        "strong",
        "sub",
        "summary",
        "sup",
        "time",
        "u",
        "ul",
        "var",
    ],
    exclude_tags=[
        "area",
        "aside",
        "audio",
        "base",
        "button",
        "canvas",
        "col",
        "colgroup",
        "datalist",
        "embed",
        "fieldset",
        "footer",
        "form",
        "head",
        "iframe",
        "img",
        "input",
        "label",
        "legend",
        "link",
        "map",
        "meta",
        "nav",
        "object",
        "optgroup",
        "option",
        "output",
        "param",
        "picture",
        "script",
        "search",
        "select",
        "source",
        "style",
        "svg",
        "table",
        "tbody",
        "td",
        "textarea",
        "tfoot",
        "th",
        "thead",
        "title",
        "tr",
        "track",
        "video",
    ],
):
    """Create an XPath expression to exract body text form a web page.

    This approximates the page content that you are probably interested in (article text, product description, blog post, etc.)
    You can change which tags are included and which are not by modifying the available parameters.

    Parameters
    ----------
    include_tags : list
        A list of tags that typically contain text elements. You can add to it (if you want to include text in iframes, or td for example.)
    exclude_tags : list
        A list of tags that typically don't contain text (video, image, etc.). Make sure to remove the ones you don't want from this list if you want to modify the first one.

    Returns
    -------
    xpath_selector : str
        An XPath selector to be used in scraping the main body text of a webpage.
    """
    common_elements = set(include_tags).intersection(exclude_tags)
    if common_elements:
        raise ValueError(f"Please make sure you don't include and exclude the same elements:{common_elements}")
    include_tags.sort()
    exclude_tags.sort()
    include_tags_expr = "".join([f"self::{tag} or " for tag in include_tags])[:-4]
    exclude_tags_expr = "".join(
        [f"not(ancestor::{tag}) and " for tag in exclude_tags]
    )[:-5]
    final_exp = "//body//*[" + include_tags_expr + "][" + exclude_tags_expr + "]/text()"
    return final_exp

Running this function give us the XPath expression that can be used with any SEO crawler if you want to try it:

bodytext_xpath()
//body//*[self::a or self::abbr or self::address or self::b or self::blockquote or self::cite or self::code or self::dd or self::del or self::div or self::dl or self::dt or self::em or self::h1 or self::h2 or self::h3 or self::h4 or self::h5 or self::h6 or self::header or self::i or self::ins or self::kbd or self::li or self::mark or self::ol or self::p or self::pre or self::q or self::section or self::small or self::span or self::strong or self::sub or self::sup or self::time or self::u or self::ul][not(ancestor::area) and not(ancestor::aside) and not(ancestor::audio) and not(ancestor::button) and not(ancestor::caption) and not(ancestor::col) and not(ancestor::colgroup) and not(ancestor::datalist) and not(ancestor::details) and not(ancestor::embed) and not(ancestor::fieldset) and not(ancestor::footer) and not(ancestor::form) and not(ancestor::head) and not(ancestor::iframe) and not(ancestor::img) and not(ancestor::input) and not(ancestor::label) and not(ancestor::legend) and not(ancestor::link) and not(ancestor::map) and not(ancestor::meta) and not(ancestor::nav) and not(ancestor::noscript) and not(ancestor::object) and not(ancestor::optgroup) and not(ancestor::option) and not(ancestor::output) and not(ancestor::param) and not(ancestor::picture) and not(ancestor::script) and not(ancestor::select) and not(ancestor::source) and not(ancestor::style) and not(ancestor::svg) and not(ancestor::table) and not(ancestor::tbody) and not(ancestor::td) and not(ancestor::textarea) and not(ancestor::tfoot) and not(ancestor::th) and not(ancestor::thead) and not(ancestor::title) and not(ancestor::tr) and not(ancestor::track) and not(ancestor::video)]/text()

Examples

In this example you can see that the extracted text starts with “Select Page”, which is hidden. Then you can see “From Merchant Feed…”, right after the navigation links. The extracted text ends with “using WordLift”, and the remaining text is not extracted because it is part of the footer.

Here, there is a longer hidden text, “Skip to content…free trial”, then we start with the main content, “Sell everywhere…”, until “capture every payment”.