![]() |
Qucs-S S-parameter Viewer & RF Synthesis Tools
|

Public Member Functions | |
| __init__ (self, bytes markup, Optional[_Encodings] known_definite_encodings=[], Optional[Literal["ascii", "xml", "html"]] smart_quotes_to=None, bool is_html=False, Optional[_Encodings] exclude_encodings=[], Optional[_Encodings] user_encodings=None, Optional[_Encodings] override_encodings=None) | |
| Optional[_Encoding] | declared_html_encoding (self) |
| Optional[str] | find_codec (self, _Encoding charset) |
| Tuple[str, bool] | numeric_character_reference (cls, int numeric) |
| bytes | detwingle (cls, bytes in_bytes, _Encoding main_encoding="utf8", _Encoding embedded_encoding="windows-1252") |
Public Attributes | |
| smart_quotes_to | |
| tried_encodings | |
| contains_replacement_characters | |
| is_html | |
| log | |
| detector | |
| markup | |
| unicode_markup | |
| original_encoding | |
Static Public Attributes | |
| bytes | markup |
| Optional | unicode_markup [str] |
| bool | contains_replacement_characters |
| Optional | original_encoding [_Encoding] |
| Optional | smart_quotes_to [str] |
| List | tried_encodings [Tuple[_Encoding, str]] |
| Logger | log |
| dict | CHARSET_ALIASES |
| list | ENCODINGS_WITH_SMART_QUOTES |
| dict | MS_CHARS |
| dict | MS_CHARS_TO_ASCII |
| dict | WINDOWS_1252_TO_UTF8 |
| Set | ENUMERATED_NONCHARACTERS |
| list | MULTIBYTE_MARKERS_AND_SIZES |
| int | FIRST_MULTIBYTE_MARKER = MULTIBYTE_MARKERS_AND_SIZES[0][0] |
| int | LAST_MULTIBYTE_MARKER = MULTIBYTE_MARKERS_AND_SIZES[-1][1] |
Protected Member Functions | |
| bytes | _sub_ms_char (self, re.Match match) |
| Optional[str] | _convert_from (self, _Encoding proposed, str errors="strict") |
| str | _to_unicode (self, bytes data, _Encoding encoding, str errors="strict") |
| Optional[str] | _codec (self, _Encoding charset) |
A class for detecting the encoding of a bytestring containing an
HTML or XML document, and decoding it to Unicode. If the source
encoding is windows-1252, `UnicodeDammit` can also replace
Microsoft smart quotes with their HTML or XML equivalents.
:param markup: HTML or XML markup in an unknown encoding.
:param known_definite_encodings: When determining the encoding
of ``markup``, these encodings will be tried first, in
order. In HTML terms, this corresponds to the "known
definite encoding" step defined in `section 13.2.3.1 of the HTML standard <https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding>`_.
:param user_encodings: These encodings will be tried after the
``known_definite_encodings`` have been tried and failed, and
after an attempt to sniff the encoding by looking at a
byte order mark has failed. In HTML terms, this
corresponds to the step "user has explicitly instructed
the user agent to override the document's character
encoding", defined in `section 13.2.3.2 of the HTML standard <https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding>`_.
:param override_encodings: A **deprecated** alias for
``known_definite_encodings``. Any encodings here will be tried
immediately after the encodings in
``known_definite_encodings``.
:param smart_quotes_to: By default, Microsoft smart quotes will,
like all other characters, be converted to Unicode
characters. Setting this to ``ascii`` will convert them to ASCII
quotes instead. Setting it to ``xml`` will convert them to XML
entity references, and setting it to ``html`` will convert them
to HTML entity references.
:param is_html: If True, ``markup`` is treated as an HTML
document. Otherwise it's treated as an XML document.
:param exclude_encodings: These encodings will not be considered,
even if the sniffing code thinks they might make sense.
|
protected |
Attempt to convert the markup to the proposed encoding. :param proposed: The name of a character encoding. :param errors: An error handling strategy, used when calling `str`. :return: The converted markup, or `None` if the proposed encoding/error handling strategy didn't work.
|
protected |
Changes a MS smart quote character to an XML or HTML entity, or an ASCII character. TODO: Since this is only used to convert smart quotes, it could be simplified, and MS_CHARS_TO_ASCII made much less parochial.
|
protected |
Given a bytestring and its encoding, decodes the string into Unicode. :param encoding: The name of an encoding. :param errors: An error handling strategy, used when calling `str`.
| Optional[_Encoding] bs4.dammit.UnicodeDammit.declared_html_encoding | ( | self | ) |
If the markup is an HTML document, returns the encoding, if any, declared *inside* the document.
| bytes bs4.dammit.UnicodeDammit.detwingle | ( | cls, | |
| bytes | in_bytes, | ||
| _Encoding | main_encoding = "utf8", |
||
| _Encoding | embedded_encoding = "windows-1252" |
||
| ) |
Fix characters from one encoding embedded in some other encoding.
Currently the only situation supported is Windows-1252 (or its
subset ISO-8859-1), embedded in UTF-8.
:param in_bytes: A bytestring that you suspect contains
characters from multiple encodings. Note that this *must*
be a bytestring. If you've already converted the document
to Unicode, you're too late.
:param main_encoding: The primary encoding of ``in_bytes``.
:param embedded_encoding: The encoding that was used to embed characters
in the main document.
:return: A bytestring similar to ``in_bytes``, in which
``embedded_encoding`` characters have been converted to
their ``main_encoding`` equivalents.
| Optional[str] bs4.dammit.UnicodeDammit.find_codec | ( | self, | |
| _Encoding | charset | ||
| ) |
Look up the Python codec corresponding to a given character set. :param charset: The name of a character set. :return: The name of a Python codec.
| Tuple[str, bool] bs4.dammit.UnicodeDammit.numeric_character_reference | ( | cls, | |
| int | numeric | ||
| ) |
This (mostly) implements the algorithm described in "Numeric character reference end state" from the HTML spec: https://html.spec.whatwg.org/multipage/parsing.html#numeric-character-reference-end-state The algorithm is designed to convert numeric character references like "☃" to Unicode characters like "☃". :return: A 2-tuple (character, replaced). `character` is the Unicode character corresponding to the numeric reference and `replaced` is whether or not an unresolvable character was replaced with REPLACEMENT CHARACTER.
|
static |
|
static |
|
static |
|
static |