|
|
| __init__ (self, bytes markup, Optional[_Encodings] known_definite_encodings=None, Optional[bool] is_html=False, Optional[_Encodings] exclude_encodings=None, Optional[_Encodings] user_encodings=None, Optional[_Encodings] override_encodings=None) |
| |
| Iterator[_Encoding] | encodings (self) |
| |
| Tuple[bytes, Optional[_Encoding]] | strip_byte_order_mark (cls, bytes data) |
| |
| Optional[_Encoding] | find_declared_encoding (cls, Union[bytes, str] markup, bool is_html=False, bool search_entire_document=False) |
| |
This class is capable of guessing a number of possible encodings
for a bytestring.
Order of precedence:
1. Encodings you specifically tell EncodingDetector to try first
(the ``known_definite_encodings`` argument to the constructor).
2. An encoding determined by sniffing the document's byte-order mark.
3. Encodings you specifically tell EncodingDetector to try if
byte-order mark sniffing fails (the ``user_encodings`` argument to the
constructor).
4. An encoding declared within the bytestring itself, either in an
XML declaration (if the bytestring is to be interpreted as an XML
document), or in a <meta> tag (if the bytestring is to be
interpreted as an HTML document.)
5. An encoding detected through textual analysis by chardet,
cchardet, or a similar external library.
6. UTF-8.
7. Windows-1252.
:param markup: Some markup in an unknown encoding.
:param known_definite_encodings: When determining the encoding
of ``markup``, these encodings will be tried first, in
order. In HTML terms, this corresponds to the "known
definite encoding" step defined in `section 13.2.3.1 of the HTML standard <https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding>`_.
:param user_encodings: These encodings will be tried after the
``known_definite_encodings`` have been tried and failed, and
after an attempt to sniff the encoding by looking at a
byte order mark has failed. In HTML terms, this
corresponds to the step "user has explicitly instructed
the user agent to override the document's character
encoding", defined in `section 13.2.3.2 of the HTML standard <https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding>`_.
:param override_encodings: A **deprecated** alias for
``known_definite_encodings``. Any encodings here will be tried
immediately after the encodings in
``known_definite_encodings``.
:param is_html: If True, this markup is considered to be
HTML. Otherwise it's assumed to be XML.
:param exclude_encodings: These encodings will not be tried,
even if they otherwise would be.
| Optional[_Encoding] bs4.dammit.EncodingDetector.find_declared_encoding |
( |
|
cls, |
|
|
Union[bytes, str] |
markup, |
|
|
bool |
is_html = False, |
|
|
bool |
search_entire_document = False |
|
) |
| |
Given a document, tries to find an encoding declared within the
text of the document itself.
An XML encoding is declared at the beginning of the document.
An HTML encoding is declared in a <meta> tag, hopefully near the
beginning of the document.
:param markup: Some markup.
:param is_html: If True, this markup is considered to be HTML. Otherwise
it's assumed to be XML.
:param search_entire_document: Since an encoding is supposed
to declared near the beginning of the document, most of
the time it's only necessary to search a few kilobytes of
data. Set this to True to force this method to search the
entire document.
:return: The declared encoding, if one is found.