Qucs-S S-parameter Viewer & RF Synthesis Tools
Loading...
Searching...
No Matches
Public Member Functions | Public Attributes | Static Public Attributes | Protected Member Functions | List of all members
bs4.dammit.UnicodeDammit Class Reference
Collaboration diagram for bs4.dammit.UnicodeDammit:
Collaboration graph
[legend]

Public Member Functions

 __init__ (self, bytes markup, Optional[_Encodings] known_definite_encodings=[], Optional[Literal["ascii", "xml", "html"]] smart_quotes_to=None, bool is_html=False, Optional[_Encodings] exclude_encodings=[], Optional[_Encodings] user_encodings=None, Optional[_Encodings] override_encodings=None)
 
Optional[_Encoding] declared_html_encoding (self)
 
Optional[str] find_codec (self, _Encoding charset)
 
Tuple[str, bool] numeric_character_reference (cls, int numeric)
 
bytes detwingle (cls, bytes in_bytes, _Encoding main_encoding="utf8", _Encoding embedded_encoding="windows-1252")
 

Public Attributes

 smart_quotes_to
 
 tried_encodings
 
 contains_replacement_characters
 
 is_html
 
 log
 
 detector
 
 markup
 
 unicode_markup
 
 original_encoding
 

Static Public Attributes

bytes markup
 
Optional unicode_markup [str]
 
bool contains_replacement_characters
 
Optional original_encoding [_Encoding]
 
Optional smart_quotes_to [str]
 
List tried_encodings [Tuple[_Encoding, str]]
 
Logger log
 
dict CHARSET_ALIASES
 
list ENCODINGS_WITH_SMART_QUOTES
 
dict MS_CHARS
 
dict MS_CHARS_TO_ASCII
 
dict WINDOWS_1252_TO_UTF8
 
Set ENUMERATED_NONCHARACTERS
 
list MULTIBYTE_MARKERS_AND_SIZES
 
int FIRST_MULTIBYTE_MARKER = MULTIBYTE_MARKERS_AND_SIZES[0][0]
 
int LAST_MULTIBYTE_MARKER = MULTIBYTE_MARKERS_AND_SIZES[-1][1]
 

Protected Member Functions

bytes _sub_ms_char (self, re.Match match)
 
Optional[str] _convert_from (self, _Encoding proposed, str errors="strict")
 
str _to_unicode (self, bytes data, _Encoding encoding, str errors="strict")
 
Optional[str] _codec (self, _Encoding charset)
 

Detailed Description

A class for detecting the encoding of a bytestring containing an
HTML or XML document, and decoding it to Unicode. If the source
encoding is windows-1252, `UnicodeDammit` can also replace
Microsoft smart quotes with their HTML or XML equivalents.

:param markup: HTML or XML markup in an unknown encoding.

:param known_definite_encodings: When determining the encoding
    of ``markup``, these encodings will be tried first, in
    order. In HTML terms, this corresponds to the "known
    definite encoding" step defined in `section 13.2.3.1 of the HTML standard <https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding>`_.

:param user_encodings: These encodings will be tried after the
    ``known_definite_encodings`` have been tried and failed, and
    after an attempt to sniff the encoding by looking at a
    byte order mark has failed. In HTML terms, this
    corresponds to the step "user has explicitly instructed
    the user agent to override the document's character
    encoding", defined in `section 13.2.3.2 of the HTML standard <https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding>`_.

:param override_encodings: A **deprecated** alias for
    ``known_definite_encodings``. Any encodings here will be tried
    immediately after the encodings in
    ``known_definite_encodings``.

:param smart_quotes_to: By default, Microsoft smart quotes will,
   like all other characters, be converted to Unicode
   characters. Setting this to ``ascii`` will convert them to ASCII
   quotes instead.  Setting it to ``xml`` will convert them to XML
   entity references, and setting it to ``html`` will convert them
   to HTML entity references.

:param is_html: If True, ``markup`` is treated as an HTML
   document. Otherwise it's treated as an XML document.

:param exclude_encodings: These encodings will not be considered,
   even if the sniffing code thinks they might make sense.

Member Function Documentation

◆ _convert_from()

Optional[str] bs4.dammit.UnicodeDammit._convert_from (   self,
_Encoding  proposed,
str   errors = "strict" 
)
protected
Attempt to convert the markup to the proposed encoding.

:param proposed: The name of a character encoding.
:param errors: An error handling strategy, used when calling `str`.
:return: The converted markup, or `None` if the proposed
   encoding/error handling strategy didn't work.

◆ _sub_ms_char()

bytes bs4.dammit.UnicodeDammit._sub_ms_char (   self,
re.Match  match 
)
protected
Changes a MS smart quote character to an XML or HTML
entity, or an ASCII character.

TODO: Since this is only used to convert smart quotes, it
could be simplified, and MS_CHARS_TO_ASCII made much less
parochial.

◆ _to_unicode()

str bs4.dammit.UnicodeDammit._to_unicode (   self,
bytes  data,
_Encoding  encoding,
str   errors = "strict" 
)
protected
Given a bytestring and its encoding, decodes the string into Unicode.

:param encoding: The name of an encoding.
:param errors: An error handling strategy, used when calling `str`.

◆ declared_html_encoding()

Optional[_Encoding] bs4.dammit.UnicodeDammit.declared_html_encoding (   self)
If the markup is an HTML document, returns the encoding, if any,
declared *inside* the document.

◆ detwingle()

bytes bs4.dammit.UnicodeDammit.detwingle (   cls,
bytes  in_bytes,
_Encoding   main_encoding = "utf8",
_Encoding   embedded_encoding = "windows-1252" 
)
Fix characters from one encoding embedded in some other encoding.

Currently the only situation supported is Windows-1252 (or its
subset ISO-8859-1), embedded in UTF-8.

:param in_bytes: A bytestring that you suspect contains
    characters from multiple encodings. Note that this *must*
    be a bytestring. If you've already converted the document
    to Unicode, you're too late.
:param main_encoding: The primary encoding of ``in_bytes``.
:param embedded_encoding: The encoding that was used to embed characters
    in the main document.
:return: A bytestring similar to ``in_bytes``, in which
  ``embedded_encoding`` characters have been converted to
  their ``main_encoding`` equivalents.

◆ find_codec()

Optional[str] bs4.dammit.UnicodeDammit.find_codec (   self,
_Encoding  charset 
)
Look up the Python codec corresponding to a given character set.

:param charset: The name of a character set.
:return: The name of a Python codec.

◆ numeric_character_reference()

Tuple[str, bool] bs4.dammit.UnicodeDammit.numeric_character_reference (   cls,
int  numeric 
)
This (mostly) implements the algorithm described in "Numeric character
reference end state" from the HTML spec:
https://html.spec.whatwg.org/multipage/parsing.html#numeric-character-reference-end-state

The algorithm is designed to convert numeric character references like "&#9731;"
to Unicode characters like "☃".

:return: A 2-tuple (character, replaced). `character` is the Unicode
character corresponding to the numeric reference and `replaced` is
whether or not an unresolvable character was replaced with REPLACEMENT
CHARACTER.

Member Data Documentation

◆ CHARSET_ALIASES

dict bs4.dammit.UnicodeDammit.CHARSET_ALIASES
static
Initial value:
= {
"macintosh": "mac-roman",
"x-sjis": "shift-jis",
}

◆ ENCODINGS_WITH_SMART_QUOTES

list bs4.dammit.UnicodeDammit.ENCODINGS_WITH_SMART_QUOTES
static
Initial value:
= [
"windows-1252",
"iso-8859-1",
"iso-8859-2",
]

◆ ENUMERATED_NONCHARACTERS

Set bs4.dammit.UnicodeDammit.ENUMERATED_NONCHARACTERS
static
Initial value:
= set([0xfffe, 0xffff,
0x1fffe, 0x1ffff,
0x2fffe, 0x2ffff,
0x3fffe, 0x3ffff,
0x4fffe, 0x4ffff,
0x5fffe, 0x5ffff,
0x6fffe, 0x6ffff,
0x7fffe, 0x7ffff,
0x8fffe, 0x8ffff,
0x9fffe, 0x9ffff,
0xafffe, 0xaffff,
0xbfffe, 0xbffff,
0xcfffe, 0xcffff,
0xdfffe, 0xdffff,
0xefffe, 0xeffff,
0xffffe, 0xfffff,
0x10fffe, 0x10ffff])

◆ MULTIBYTE_MARKERS_AND_SIZES

list bs4.dammit.UnicodeDammit.MULTIBYTE_MARKERS_AND_SIZES
static
Initial value:
= [
(0xC2, 0xDF, 2), # 2-byte characters start with a byte C2-DF
(0xE0, 0xEF, 3), # 3-byte characters start with E0-EF
(0xF0, 0xF4, 4), # 4-byte characters start with F0-F4
]

The documentation for this class was generated from the following file: