\HTMLPurifier_Encoder

A UTF-8 specific character encoder that handles cleaning and transforming.

Summary

Methods
Properties
Constants
muteErrorHandler()
unsafeIconv()
iconv()
cleanUTF8()
unichr()
iconvAvailable()
convertToUTF8()
convertFromUTF8()
convertToASCIIDumbLossless()
testIconvTruncateBug()
testEncodingSupportsASCII()
No public properties found
ICONV_OK
ICONV_TRUNCATES
ICONV_UNUSABLE
No protected methods found
No protected properties found
N/A
__construct()
No private properties found
N/A

Constants

ICONV_OK

ICONV_OK = 0

No bugs detected in iconv.

ICONV_TRUNCATES

ICONV_TRUNCATES = 1

Iconv truncates output if converting from UTF-8 to another character set with //IGNORE, and a non-encodable character is found

ICONV_UNUSABLE

ICONV_UNUSABLE = 2

Iconv does not support //IGNORE, making it unusable for transcoding purposes

Methods

muteErrorHandler()

muteErrorHandler() : mixed

Error-handler that mutes errors, alternative to shut-up operator.

Returns

mixed —

unsafeIconv()

unsafeIconv(string  $in, string  $out, string  $text) : string

iconv wrapper which mutes errors, but doesn't work around bugs.

Parameters

string $in

Input encoding

string $out

Output encoding

string $text

The text to convert

Returns

string —

iconv()

iconv(string  $in, string  $out, string  $text, int  $max_chunk_size = 8000) : string

iconv wrapper which mutes errors and works around bugs.

Parameters

string $in

Input encoding

string $out

Output encoding

string $text

The text to convert

int $max_chunk_size

Returns

string —

cleanUTF8()

cleanUTF8(string  $str, bool  $force_php = false) : string

Cleans a UTF-8 string for well-formedness and SGML validity

It will parse according to UTF-8 and return a valid UTF8 string, with non-SGML codepoints excluded.

Specifically, it will permit: \x{9}\x{A}\x{D}\x{20}-\x{7E}\x{A0}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF} Source: https://www.w3.org/TR/REC-xml/#NT-Char Arguably this function should be modernized to the HTML5 set of allowed characters: https://www.w3.org/TR/html5/syntax.html#preprocessing-the-input-stream which simultaneously expand and restrict the set of allowed characters.

Parameters

string $str

The string to clean

bool $force_php

Returns

string —

unichr()

unichr(mixed  $code) : mixed

Translates a Unicode codepoint into its corresponding UTF-8 character.

Parameters

mixed $code

Returns

mixed —

iconvAvailable()

iconvAvailable() : bool

Returns

bool —

convertToUTF8()

convertToUTF8(string  $str, \HTMLPurifier_Config  $config, \HTMLPurifier_Context  $context) : string

Convert a string to UTF-8 based on configuration.

Parameters

string $str

The string to convert

\HTMLPurifier_Config $config
\HTMLPurifier_Context $context

Returns

string —

convertFromUTF8()

convertFromUTF8(string  $str, \HTMLPurifier_Config  $config, \HTMLPurifier_Context  $context) : string

Converts a string from UTF-8 based on configuration.

Parameters

string $str

The string to convert

\HTMLPurifier_Config $config
\HTMLPurifier_Context $context

Returns

string —

convertToASCIIDumbLossless()

convertToASCIIDumbLossless(string  $str) : string

Lossless (character-wise) conversion of HTML to ASCII

Parameters

string $str

UTF-8 string to be converted to ASCII

Returns

string —

ASCII encoded string with non-ASCII character entity-ized

testIconvTruncateBug()

testIconvTruncateBug() : int

glibc iconv has a known bug where it doesn't handle the magic //IGNORE stanza correctly. In particular, rather than ignore characters, it will return an EILSEQ after consuming some number of characters, and expect you to restart iconv as if it were an E2BIG. Old versions of PHP did not respect the errno, and returned the fragment, so as a result you would see iconv mysteriously truncating output. We can work around this by manually chopping our input into segments of about 8000 characters, as long as PHP ignores the error code. If PHP starts paying attention to the error code, iconv becomes unusable.

Returns

int —

Error code indicating severity of bug.

testEncodingSupportsASCII()

testEncodingSupportsASCII(string  $encoding, bool  $bypass = false) : array

This expensive function tests whether or not a given character encoding supports ASCII. 7/8-bit encodings like Shift_JIS will fail this test, and require special processing. Variable width encodings shouldn't ever fail.

Parameters

string $encoding

Encoding name to test, as per iconv format

bool $bypass

Whether or not to bypass the precompiled arrays.

Returns

array —

of UTF-8 characters to their corresponding ASCII, which can be used to "undo" any overzealous iconv action.

__construct()

__construct() : mixed

Constructor throws fatal error if you attempt to instantiate class

Returns

mixed —