UTF-8 HTML Entities

The PHP-function htmlentities() already converts umlauts from UTF-8 to entities correctly but to the typical ä, ß, etc. - which need to be defined for use in normal XML files. htmlspecialchars() converts characters to ordinary entities but ignores umlauts and also, UTF-8 strings get converted by octets instead of as a whole (รค gets ä instead of ä).

This function recognizes the single UTF-8 groups and converts them to ordinary hexadecimal entities. Prior to this, found HTML-entities are decoded. Afterwards the function will check whether to enclose the resulting string in CDATA-tags or not.

utf8entitydecode.inc.php
function translate($txt) {
    $txt = html_entity_decode($txt);
    $txt2 = '';
    for ($i=0;$i<strlen($txt);$i++) {
        $o = ord($txt{$i});
        if ($o<128) {
            // 0..127: raw
            $txt2 .= $txt{$i};
        } else {
            $o1 = 0;
            $o2 = 0;
            $o3 = 0;
            if ($i<strlen($txt)-1) $o1 = ord($txt{$i+1});
            if ($i<strlen($txt)-2) $o2 = ord($txt{$i+2});
            if ($i<strlen($txt)-3) $o3 = ord($txt{$i+3});
 
            $hexval = 0;
            if ($o>=0xc0 && $o<0xc2) {
                // INVALID --- should never occur: 2-byte UTF-8 although value < 128
                $hexval = $o1;
                $i++;
            } elseif ($o>=0xc2 && $o<0xe0 && $o1>=0x80) {
                // 194..223: 2-byte UTF-8
                $hexval &= ($o  & 0x1f) << 6;   // 1. byte: five bits of 1. char
                $hexval &= ($o1 & 0x3f);   // 2. byte: six bits of 2. char
                $i++;
            } elseif ($o>=0xe0 && $o<0xf0 && $o1>=0x80 && $o2>=0x80) {
                // 224..239: 3-byte UTF-8
                $hexval &= ($o  & 0x0f) << 12;  // 1. byte: four bits of 1. char
                $hexval &= ($o1 & 0x3f) << 6;  // 2.+3. byte: six bits of 2.+3. char
                $hexval &= ($o2 & 0x3f);
                $i += 2;
            } elseif ($o>=0xf0 && $o<0xf4 && $o1>=0x80) {
                // 240..244: 4-byte UTF-8
                $hexval &= ($o  & 0x07) << 18; // 1. byte: three bits of 1. char
                $hexval &= ($o1 & 0x3f) << 12; // 2.-4. byte: six bits of 2.-4. char
                $hexval &= ($o2 & 0x3f) << 6;
                $hexval &= ($o3 & 0x3f);
                $i += 3;
            } else {
                // don't know ... just encode
                $hexval = $o;
            }
            $hexstring = dechex($hexval);
            if (strlen($hexstring)%2) $hexstring = '0' . $hexstring;
            $txt2 .= '&#x' . $hexstring . ';';
        }
    }
    $txt = $txt2;
    $result = '';
    if (preg_match('/[<&]/i', $txt)>0) $iscdata = true;
    if ($iscdata) $result .= '<![CDATA[';
    $result .= utf8_encode($txt);
    if ($iscdata) $result .= ']]>';
    return $result;
}

 
snippets/php/utf8entities.txt · Last modified: 2010-01-15 14:37.36 by mbirth
 
Except where otherwise noted, content on this wiki is licensed under the following license:CC Attribution-Noncommercial-Share Alike 3.0 Unported
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki Contents powered by Club-Mate Contents powered by BassDrive.com Labelled with ICRA