The PHP-function htmlentities() already converts umlauts from UTF-8 to entities correctly but to the typical ä, ß, etc. - which need to be defined for use in normal XML files. htmlspecialchars() converts characters to ordinary entities but ignores umlauts and also, UTF-8 strings get converted by octets instead of as a whole (รค gets ä instead of ä).
This function recognizes the single UTF-8 groups and converts them to ordinary hexadecimal entities. Prior to this, found HTML-entities are decoded. Afterwards the function will check whether to enclose the resulting string in CDATA-tags or not.
function translate($txt) { $txt = html_entity_decode($txt); $txt2 = ''; for ($i=0;$i<strlen($txt);$i++) { $o = ord($txt{$i}); if ($o<128) { // 0..127: raw $txt2 .= $txt{$i}; } else { $o1 = 0; $o2 = 0; $o3 = 0; if ($i<strlen($txt)-1) $o1 = ord($txt{$i+1}); if ($i<strlen($txt)-2) $o2 = ord($txt{$i+2}); if ($i<strlen($txt)-3) $o3 = ord($txt{$i+3}); $hexval = 0; if ($o>=0xc0 && $o<0xc2) { // INVALID --- should never occur: 2-byte UTF-8 although value < 128 $hexval = $o1; $i++; } elseif ($o>=0xc2 && $o<0xe0 && $o1>=0x80) { // 194..223: 2-byte UTF-8 $hexval &= ($o & 0x1f) << 6; // 1. byte: five bits of 1. char $hexval &= ($o1 & 0x3f); // 2. byte: six bits of 2. char $i++; } elseif ($o>=0xe0 && $o<0xf0 && $o1>=0x80 && $o2>=0x80) { // 224..239: 3-byte UTF-8 $hexval &= ($o & 0x0f) << 12; // 1. byte: four bits of 1. char $hexval &= ($o1 & 0x3f) << 6; // 2.+3. byte: six bits of 2.+3. char $hexval &= ($o2 & 0x3f); $i += 2; } elseif ($o>=0xf0 && $o<0xf4 && $o1>=0x80) { // 240..244: 4-byte UTF-8 $hexval &= ($o & 0x07) << 18; // 1. byte: three bits of 1. char $hexval &= ($o1 & 0x3f) << 12; // 2.-4. byte: six bits of 2.-4. char $hexval &= ($o2 & 0x3f) << 6; $hexval &= ($o3 & 0x3f); $i += 3; } else { // don't know ... just encode $hexval = $o; } $hexstring = dechex($hexval); if (strlen($hexstring)%2) $hexstring = '0' . $hexstring; $txt2 .= '&#x' . $hexstring . ';'; } } $txt = $txt2; $result = ''; if (preg_match('/[<&]/i', $txt)>0) $iscdata = true; if ($iscdata) $result .= '<![CDATA['; $result .= utf8_encode($txt); if ($iscdata) $result .= ']]>'; return $result; }