tkmilan.parser#

Parsers for custom formats.

Reuse existing parsers as much as possible, don’t reinvent wheels.

Functions

`escape_LTML`(string, *[, quote])	Convert the `string` into safe Lite Text Markup Language.
`parse_LTML`(ltml[, __parser])	Parse Lite Text Markup Language using the default settings.

Classes

`LTML`([tag_act, tag_simplify])	Lite Text Markup Language parser.
`LTML_Attributes`(identifier, classes, data, ...)	`LTML` attributes (see the main documentation).

class tkmilan.parser.LTML(tag_act: bool = True, tag_simplify: bool = True)#

Bases: HTMLParser

Lite Text Markup Language parser.

This is a small subset of HTML, to mark spans of text with metadata.

The HTML subset is a very small piece of the whole HTML specification. See TAGS_FULL and TAGS_SE for the list of allowed tags. All text ouside the tags is included as untagged spans.

The API is analogous to HTMLParser, but supports only the LTML subset. After creating the parser object, feed it data using the feed function, and reuse the parser with new data by calling the reset function.

Note

Chunked feed usage is not supported (not even detected), please send the entire text at once, or at least break it at the tags.

The output command list is available in cmdlist, it’s a list of model.TextElement. The text element main tag is the LTML tag name (i.e for <span>TXT</span>, the text element main tag is span).

Most attributes are ignored, but other have a special meaning and restrictions:

Attribute id:
An optional identifier.

This is sent as id:$id text element tag, if present. The value must not have any : anywhere.
Attribute class:
A list of strings, separated by whitespace.

This is sent as class:$class text element tag, for each split class name. The value must not have any : anywhere.
Attributes data-*:
All the data attributes are collected in a dictionary, removing the data- prefix from the attribute name to get the key, and using the value as-is.
Tag a; Attribute href:
A simple string, NOT a URL as in HTML.

"a:href" is sent as a-$href text element tag. The value must not have any :: anywhere.

For the tags in TAGS_SIMPLE, if there are no attributes, the tag is “simplified” by turning it into a regular text span. This is controlled by the “tag_simplify” argument.

In addition to these tags, each tag type includes the so-called Automatic Counter tags (ACT). These are markers for each instance of each target type, including it’s index position (starting at 0). The format is $tag::$index. This is controlled by “tag_act” argument.

Note

As an example of the Automatic Counter tags, consider the following LTML:

<span>1</span><span>2</span>

This results in two elements with the following tags:

1: span (Main Tag) span::0 (ACT)
2: span (Main Tag) span::1 (ACT)

Parameters:

tag_simplify (bool) – Simplify some tags (TAGS_SIMPLE).
tag_act (bool) – See the description for automatic counter tags documentation. Defaults to True.

feed(data)#

Feed data to the parser.

Call this as often as you want, with as little or as much text as you want (may include ‘n’).

Inherited from HTMLParser.feed

parse(ltml: str) → Sequence[TextElement]#

Helper function to parse a standalone string.

reset the parser
feed “ltml” to the parser
Return the output cmdlist

reset() → None#

Reset the instance. Loses all unprocessed data.

This is called when instancing the parser.

—

Extended from HTMLParser.reset

TAGS_FULL = {'a', 'b', 'i', 'span'}#

Full Tags.

Like this: <tag>...</tag>.

TAGS_MODEL = {'a': <class 'tkmilan.model.TextElementInline'>, 'b': <class 'tkmilan.model.TextElementInline'>, 'br': <class 'tkmilan.model.TextElement_br'>, 'i': <class 'tkmilan.model.TextElementInline'>, 'span': <class 'tkmilan.model.TextElementInline'>}#: Mapping tag names to its corresponding model class.

TAGS_SE = {'br'}#

Start/End Tags.

Like this: <tag />

TAGS_SIMPLE = {'span'}#

Simple Tags, that is, Full Tags that can be simplified.

See the tag_simplify option.

cmdlist: List[TextElement]#: Output command list.

class tkmilan.parser.LTML_Attributes(identifier: str | None = None, classes: ~typing.AbstractSet[str] = <factory>, data: ~typing.Mapping[str, str] = <factory>, href: str | None = None)#

Bases: object

LTML attributes (see the main documentation).

tkmilan.parser.escape_LTML(string: str, *, quote: bool = True) → str#

Convert the string into safe Lite Text Markup Language.

Wraps the upstream escape function, see html.escape.

Note

You can use the upstream function directly, instead of importing this.

Parameters:

string (str) – String to escape special LTML characters.
quote (bool) – Escape quotes too. Usually needed, since it might be embedded in an attribute. Defaults to True.

tkmilan.parser.parse_LTML(ltml: str, __parser: ~tkmilan.parser.LTML = <tkmilan.parser.LTML object>) → Sequence[TextElement]#

Parse Lite Text Markup Language using the default settings.

This always reuses the same parser, so it should be fast.

Parameters:: ltml (str) – The string to parse.