|
|
| CHARACTER_TO_HTML_ENTITY |
| |
|
| HTML_ENTITY_TO_CHARACTER |
| |
|
| CHARACTER_TO_HTML_ENTITY_RE |
| |
|
| CHARACTER_TO_HTML_ENTITY_WITH_AMPERSAND_RE |
| |
|
|
Dict | HTML_ENTITY_TO_CHARACTER [str, str] |
| |
|
Dict | CHARACTER_TO_HTML_ENTITY [str, str] |
| |
|
Pattern | CHARACTER_TO_HTML_ENTITY_RE [str] |
| |
|
Pattern | CHARACTER_TO_HTML_ENTITY_WITH_AMPERSAND_RE [str] |
| |
| dict | CHARACTER_TO_XML_ENTITY |
| |
|
| ANY_ENTITY_RE = re.compile("&(#\\d+|#x[0-9a-fA-F]+|\\w+);", re.I) |
| |
| Pattern | BARE_AMPERSAND_OR_BRACKET |
| |
|
Pattern | AMPERSAND_OR_BRACKET = re.compile("([<>&])") |
| |
The ability to substitute XML or HTML entities for certain characters.
◆ _populate_class_variables()
| None bs4.dammit.EntitySubstitution._populate_class_variables |
( |
|
cls | ) |
|
|
protected |
Initialize variables used by this class to manage the plethora of
HTML5 named entities.
This function sets the following class variables:
CHARACTER_TO_HTML_ENTITY - A mapping of Unicode strings like "⦨" to
entity names like "angmsdaa". When a single Unicode string has
multiple entity names, we try to choose the most commonly-used
name.
HTML_ENTITY_TO_CHARACTER: A mapping of entity names like "angmsdaa" to
Unicode strings like "⦨".
CHARACTER_TO_HTML_ENTITY_RE: A regular expression matching (almost) any
Unicode string that corresponds to an HTML5 named entity.
CHARACTER_TO_HTML_ENTITY_WITH_AMPERSAND_RE: A very similar
regular expression to CHARACTER_TO_HTML_ENTITY_RE, but which
also matches unescaped ampersands. This is used by the 'html'
formatted to provide backwards-compatibility, even though the HTML5
spec allows most ampersands to go unescaped.
◆ _substitute_html_entity()
| str bs4.dammit.EntitySubstitution._substitute_html_entity |
( |
|
cls, |
|
|
re.Match |
matchobj |
|
) |
| |
|
protected |
Used with a regular expression to substitute the
appropriate HTML entity for a special character string.
◆ _substitute_xml_entity()
| str bs4.dammit.EntitySubstitution._substitute_xml_entity |
( |
|
cls, |
|
|
re.Match |
matchobj |
|
) |
| |
|
protected |
Used with a regular expression to substitute the
appropriate XML entity for a special character string.
◆ quoted_attribute_value()
| str bs4.dammit.EntitySubstitution.quoted_attribute_value |
( |
|
cls, |
|
|
str |
value |
|
) |
| |
Make a value into a quoted XML attribute, possibly escaping it.
Most strings will be quoted using double quotes.
Bob's Bar -> "Bob's Bar"
If a string contains double quotes, it will be quoted using
single quotes.
Welcome to "my bar" -> 'Welcome to "my bar"'
If a string contains both single and double quotes, the
double quotes will be escaped, and the string will be quoted
using double quotes.
Welcome to "Bob's Bar" -> Welcome to "Bob's bar"
:param value: The XML attribute value to quote
:return: The quoted value
◆ substitute_html()
| str bs4.dammit.EntitySubstitution.substitute_html |
( |
|
cls, |
|
|
str |
s |
|
) |
| |
Replace certain Unicode characters with named HTML entities.
This differs from ``data.encode(encoding, 'xmlcharrefreplace')``
in that the goal is to make the result more readable (to those
with ASCII displays) rather than to recover from
errors. There's absolutely nothing wrong with a UTF-8 string
containg a LATIN SMALL LETTER E WITH ACUTE, but replacing that
character with "é" will make it more readable to some
people.
:param s: The string to be modified.
:return: The string with some Unicode characters replaced with
HTML entities.
◆ substitute_html5()
| str bs4.dammit.EntitySubstitution.substitute_html5 |
( |
|
cls, |
|
|
str |
s |
|
) |
| |
Replace certain Unicode characters with named HTML entities
using HTML5 rules.
Specifically, this method is much less aggressive about
escaping ampersands than substitute_html. Only ambiguous
ampersands are escaped, per the HTML5 standard:
"An ambiguous ampersand is a U+0026 AMPERSAND character (&)
that is followed by one or more ASCII alphanumerics, followed
by a U+003B SEMICOLON character (;), where these characters do
not match any of the names given in the named character
references section."
Unlike substitute_html5_raw, this method assumes HTML entities
were converted to Unicode characters on the way in, as
Beautiful Soup does. By the time Beautiful Soup does its work,
the only ambiguous ampersands that need to be escaped are the
ones that were escaped in the original markup when mentioning
HTML entities.
:param s: The string to be modified.
:return: The string with some Unicode characters replaced with
HTML entities.
◆ substitute_html5_raw()
| str bs4.dammit.EntitySubstitution.substitute_html5_raw |
( |
|
cls, |
|
|
str |
s |
|
) |
| |
Replace certain Unicode characters with named HTML entities
using HTML5 rules.
substitute_html5_raw is similar to substitute_html5 but it is
designed for standalone use (whereas substitute_html5 is
designed for use with Beautiful Soup).
:param s: The string to be modified.
:return: The string with some Unicode characters replaced with
HTML entities.
◆ substitute_xml()
| str bs4.dammit.EntitySubstitution.substitute_xml |
( |
|
cls, |
|
|
str |
value, |
|
|
bool |
make_quoted_attribute = False |
|
) |
| |
Replace special XML characters with named XML entities.
The less-than sign will become <, the greater-than sign
will become >, and any ampersands will become &. If you
want ampersands that seem to be part of an entity definition
to be left alone, use `substitute_xml_containing_entities`
instead.
:param value: A string to be substituted.
:param make_quoted_attribute: If True, then the string will be
quoted, as befits an attribute value.
:return: A version of ``value`` with special characters replaced
with named entities.
◆ substitute_xml_containing_entities()
| str bs4.dammit.EntitySubstitution.substitute_xml_containing_entities |
( |
|
cls, |
|
|
str |
value, |
|
|
bool |
make_quoted_attribute = False |
|
) |
| |
Substitute XML entities for special XML characters.
:param value: A string to be substituted. The less-than sign will
become <, the greater-than sign will become >, and any
ampersands that are not part of an entity defition will
become &.
:param make_quoted_attribute: If True, then the string will be
quoted, as befits an attribute value.
◆ BARE_AMPERSAND_OR_BRACKET
| Pattern bs4.dammit.EntitySubstitution.BARE_AMPERSAND_OR_BRACKET |
|
static |
Initial value:= re.compile(
"([<>]|" "&(?!#\\d+;|#x[0-9a-fA-F]+;|\\w+;)" ")"
)
◆ CHARACTER_TO_XML_ENTITY
| dict bs4.dammit.EntitySubstitution.CHARACTER_TO_XML_ENTITY |
|
static |
Initial value:= {
"'": "apos",
'"': "quot",
"&": "amp",
"<": "lt",
">": "gt",
}
The documentation for this class was generated from the following file:
- docs/help/help-venv/lib/python3.12/site-packages/bs4/dammit.py