Andrey Andreev | c5536aa | 2012-11-01 17:33:58 +0200 | [diff] [blame] | 1 | <?php |
Derek Jones | 98badc1 | 2010-03-02 13:08:02 -0600 | [diff] [blame] | 2 | /** |
| 3 | * CodeIgniter |
| 4 | * |
Phil Sturgeon | 07c1ac8 | 2012-03-09 17:03:37 +0000 | [diff] [blame] | 5 | * An open source application development framework for PHP 5.2.4 or newer |
Derek Jones | 98badc1 | 2010-03-02 13:08:02 -0600 | [diff] [blame] | 6 | * |
Derek Jones | f4a4bd8 | 2011-10-20 12:18:42 -0500 | [diff] [blame] | 7 | * NOTICE OF LICENSE |
Andrey Andreev | c123e11 | 2012-01-08 00:17:34 +0200 | [diff] [blame] | 8 | * |
Derek Jones | f4a4bd8 | 2011-10-20 12:18:42 -0500 | [diff] [blame] | 9 | * Licensed under the Open Software License version 3.0 |
Andrey Andreev | c123e11 | 2012-01-08 00:17:34 +0200 | [diff] [blame] | 10 | * |
Derek Jones | f4a4bd8 | 2011-10-20 12:18:42 -0500 | [diff] [blame] | 11 | * This source file is subject to the Open Software License (OSL 3.0) that is |
| 12 | * bundled with this package in the files license.txt / license.rst. It is |
| 13 | * also available through the world wide web at this URL: |
| 14 | * http://opensource.org/licenses/OSL-3.0 |
| 15 | * If you did not receive a copy of the license and are unable to obtain it |
| 16 | * through the world wide web, please send an email to |
| 17 | * licensing@ellislab.com so we can send you a copy immediately. |
| 18 | * |
Derek Jones | 98badc1 | 2010-03-02 13:08:02 -0600 | [diff] [blame] | 19 | * @package CodeIgniter |
Derek Jones | f4a4bd8 | 2011-10-20 12:18:42 -0500 | [diff] [blame] | 20 | * @author EllisLab Dev Team |
darwinel | 871754a | 2014-02-11 17:34:57 +0100 | [diff] [blame] | 21 | * @copyright Copyright (c) 2008 - 2014, EllisLab, Inc. (http://ellislab.com/) |
Derek Jones | f4a4bd8 | 2011-10-20 12:18:42 -0500 | [diff] [blame] | 22 | * @license http://opensource.org/licenses/OSL-3.0 Open Software License (OSL 3.0) |
Derek Jones | 98badc1 | 2010-03-02 13:08:02 -0600 | [diff] [blame] | 23 | * @link http://codeigniter.com |
Pascal Kriete | 5b2d2da | 2010-11-04 17:23:40 -0400 | [diff] [blame] | 24 | * @since Version 2.0 |
Derek Jones | 98badc1 | 2010-03-02 13:08:02 -0600 | [diff] [blame] | 25 | * @filesource |
| 26 | */ |
Andrey Andreev | c5536aa | 2012-11-01 17:33:58 +0200 | [diff] [blame] | 27 | defined('BASEPATH') OR exit('No direct script access allowed'); |
Derek Jones | 98badc1 | 2010-03-02 13:08:02 -0600 | [diff] [blame] | 28 | |
Derek Jones | 98badc1 | 2010-03-02 13:08:02 -0600 | [diff] [blame] | 29 | /** |
Pascal Kriete | aaec1e4 | 2011-01-20 00:01:21 -0500 | [diff] [blame] | 30 | * Utf8 Class |
Derek Jones | 98badc1 | 2010-03-02 13:08:02 -0600 | [diff] [blame] | 31 | * |
Pascal Kriete | aaec1e4 | 2011-01-20 00:01:21 -0500 | [diff] [blame] | 32 | * Provides support for UTF-8 environments |
Derek Jones | 98badc1 | 2010-03-02 13:08:02 -0600 | [diff] [blame] | 33 | * |
| 34 | * @package CodeIgniter |
| 35 | * @subpackage Libraries |
Pascal Kriete | aaec1e4 | 2011-01-20 00:01:21 -0500 | [diff] [blame] | 36 | * @category UTF-8 |
Derek Jones | f4a4bd8 | 2011-10-20 12:18:42 -0500 | [diff] [blame] | 37 | * @author EllisLab Dev Team |
Pascal Kriete | aaec1e4 | 2011-01-20 00:01:21 -0500 | [diff] [blame] | 38 | * @link http://codeigniter.com/user_guide/libraries/utf8.html |
Derek Jones | 98badc1 | 2010-03-02 13:08:02 -0600 | [diff] [blame] | 39 | */ |
Pascal Kriete | aaec1e4 | 2011-01-20 00:01:21 -0500 | [diff] [blame] | 40 | class CI_Utf8 { |
Derek Jones | 98badc1 | 2010-03-02 13:08:02 -0600 | [diff] [blame] | 41 | |
| 42 | /** |
Andrey Andreev | 3e9d2b8 | 2012-10-27 14:28:51 +0300 | [diff] [blame] | 43 | * Class constructor |
Barry Mieny | dd67197 | 2010-10-04 16:33:58 +0200 | [diff] [blame] | 44 | * |
Andrey Andreev | 3e9d2b8 | 2012-10-27 14:28:51 +0300 | [diff] [blame] | 45 | * Determines if UTF-8 support is to be enabled. |
Andrey Andreev | 92ebfb6 | 2012-05-17 12:49:24 +0300 | [diff] [blame] | 46 | * |
| 47 | * @return void |
Derek Jones | 98badc1 | 2010-03-02 13:08:02 -0600 | [diff] [blame] | 48 | */ |
Greg Aker | d2c4ec6 | 2011-12-25 22:52:57 -0600 | [diff] [blame] | 49 | public function __construct() |
Derek Jones | 98badc1 | 2010-03-02 13:08:02 -0600 | [diff] [blame] | 50 | { |
Andrey Andreev | c123e11 | 2012-01-08 00:17:34 +0200 | [diff] [blame] | 51 | log_message('debug', 'Utf8 Class Initialized'); |
Barry Mieny | dd67197 | 2010-10-04 16:33:58 +0200 | [diff] [blame] | 52 | |
Andrey Andreev | 6c5f751 | 2012-10-10 15:56:18 +0300 | [diff] [blame] | 53 | $charset = strtoupper(config_item('charset')); |
| 54 | |
| 55 | // set internal encoding for multibyte string functions if necessary |
| 56 | // and set a flag so we don't have to repeatedly use extension_loaded() |
| 57 | // or function_exists() |
| 58 | if (extension_loaded('mbstring')) |
| 59 | { |
| 60 | define('MB_ENABLED', TRUE); |
| 61 | mb_internal_encoding($charset); |
Andrey Andreev | be1496d | 2014-02-11 22:48:45 +0200 | [diff] [blame^] | 62 | // This is required for mb_convert_encoding() to strip invalid characters |
| 63 | ini_set('mbstring.substitute_character', 'none'); |
Andrey Andreev | 6c5f751 | 2012-10-10 15:56:18 +0300 | [diff] [blame] | 64 | } |
| 65 | else |
| 66 | { |
| 67 | define('MB_ENABLED', FALSE); |
| 68 | } |
| 69 | |
Andrey Andreev | be1496d | 2014-02-11 22:48:45 +0200 | [diff] [blame^] | 70 | // Do the same for iconv, which actually has more easy to remember |
| 71 | // predefined constants (such as ICONV_IMPL), but the iconv PHP |
| 72 | // manual page says that using them is "strongly discouraged". |
| 73 | if (extension_loaded('iconv')) |
| 74 | { |
| 75 | define('ICONV_ENABLED', TRUE); |
| 76 | iconv_set_encoding('internal_encoding', $charset); |
| 77 | } |
| 78 | else |
| 79 | { |
| 80 | define('ICONV_ENABLED', FALSE); |
| 81 | } |
| 82 | |
Derek Jones | 98badc1 | 2010-03-02 13:08:02 -0600 | [diff] [blame] | 83 | if ( |
Andrey Andreev | be1496d | 2014-02-11 22:48:45 +0200 | [diff] [blame^] | 84 | defined('PREG_BAD_UTF8_ERROR') // PCRE must support UTF-8 |
| 85 | && (ICONV_ENABLED === TRUE OR MB_ENABLED === TRUE) // iconv or mbstring must be installed |
| 86 | && $charset === 'UTF-8' // Application charset must be UTF-8 |
Derek Jones | 98badc1 | 2010-03-02 13:08:02 -0600 | [diff] [blame] | 87 | ) |
| 88 | { |
Derek Jones | 98badc1 | 2010-03-02 13:08:02 -0600 | [diff] [blame] | 89 | define('UTF8_ENABLED', TRUE); |
Andrey Andreev | c123e11 | 2012-01-08 00:17:34 +0200 | [diff] [blame] | 90 | log_message('debug', 'UTF-8 Support Enabled'); |
Derek Jones | 98badc1 | 2010-03-02 13:08:02 -0600 | [diff] [blame] | 91 | } |
| 92 | else |
| 93 | { |
Derek Jones | 98badc1 | 2010-03-02 13:08:02 -0600 | [diff] [blame] | 94 | define('UTF8_ENABLED', FALSE); |
Andrey Andreev | c123e11 | 2012-01-08 00:17:34 +0200 | [diff] [blame] | 95 | log_message('debug', 'UTF-8 Support Disabled'); |
Barry Mieny | dd67197 | 2010-10-04 16:33:58 +0200 | [diff] [blame] | 96 | } |
Derek Jones | 98badc1 | 2010-03-02 13:08:02 -0600 | [diff] [blame] | 97 | } |
Barry Mieny | dd67197 | 2010-10-04 16:33:58 +0200 | [diff] [blame] | 98 | |
Derek Jones | 98badc1 | 2010-03-02 13:08:02 -0600 | [diff] [blame] | 99 | // -------------------------------------------------------------------- |
Barry Mieny | dd67197 | 2010-10-04 16:33:58 +0200 | [diff] [blame] | 100 | |
Derek Jones | 98badc1 | 2010-03-02 13:08:02 -0600 | [diff] [blame] | 101 | /** |
| 102 | * Clean UTF-8 strings |
| 103 | * |
Andrey Andreev | 3e9d2b8 | 2012-10-27 14:28:51 +0300 | [diff] [blame] | 104 | * Ensures strings contain only valid UTF-8 characters. |
Derek Jones | 98badc1 | 2010-03-02 13:08:02 -0600 | [diff] [blame] | 105 | * |
Andrey Andreev | 3e9d2b8 | 2012-10-27 14:28:51 +0300 | [diff] [blame] | 106 | * @uses CI_Utf8::_is_ascii() Decide whether a conversion is needed |
| 107 | * |
| 108 | * @param string $str String to clean |
Derek Jones | 98badc1 | 2010-03-02 13:08:02 -0600 | [diff] [blame] | 109 | * @return string |
| 110 | */ |
Greg Aker | d2c4ec6 | 2011-12-25 22:52:57 -0600 | [diff] [blame] | 111 | public function clean_string($str) |
Derek Jones | 98badc1 | 2010-03-02 13:08:02 -0600 | [diff] [blame] | 112 | { |
| 113 | if ($this->_is_ascii($str) === FALSE) |
| 114 | { |
Andrey Andreev | be1496d | 2014-02-11 22:48:45 +0200 | [diff] [blame^] | 115 | if (ICONV_ENABLED) |
| 116 | { |
| 117 | $str = @iconv('UTF-8', 'UTF-8//IGNORE', $str); |
| 118 | } |
| 119 | elseif (MB_ENABLED) |
| 120 | { |
| 121 | $str = mb_convert_encoding($str, 'UTF-8', 'UTF-8'); |
| 122 | } |
Derek Jones | 98badc1 | 2010-03-02 13:08:02 -0600 | [diff] [blame] | 123 | } |
Barry Mieny | dd67197 | 2010-10-04 16:33:58 +0200 | [diff] [blame] | 124 | |
Derek Jones | 98badc1 | 2010-03-02 13:08:02 -0600 | [diff] [blame] | 125 | return $str; |
| 126 | } |
| 127 | |
| 128 | // -------------------------------------------------------------------- |
Barry Mieny | dd67197 | 2010-10-04 16:33:58 +0200 | [diff] [blame] | 129 | |
Derek Jones | 98badc1 | 2010-03-02 13:08:02 -0600 | [diff] [blame] | 130 | /** |
| 131 | * Remove ASCII control characters |
| 132 | * |
| 133 | * Removes all ASCII control characters except horizontal tabs, |
| 134 | * line feeds, and carriage returns, as all others can cause |
Andrey Andreev | 3e9d2b8 | 2012-10-27 14:28:51 +0300 | [diff] [blame] | 135 | * problems in XML. |
Barry Mieny | dd67197 | 2010-10-04 16:33:58 +0200 | [diff] [blame] | 136 | * |
Andrey Andreev | 3e9d2b8 | 2012-10-27 14:28:51 +0300 | [diff] [blame] | 137 | * @param string $str String to clean |
Derek Jones | 98badc1 | 2010-03-02 13:08:02 -0600 | [diff] [blame] | 138 | * @return string |
| 139 | */ |
Greg Aker | d2c4ec6 | 2011-12-25 22:52:57 -0600 | [diff] [blame] | 140 | public function safe_ascii_for_xml($str) |
Derek Jones | 98badc1 | 2010-03-02 13:08:02 -0600 | [diff] [blame] | 141 | { |
Pascal Kriete | 14a0ac6 | 2011-04-05 14:55:56 -0400 | [diff] [blame] | 142 | return remove_invisible_characters($str, FALSE); |
Derek Jones | 98badc1 | 2010-03-02 13:08:02 -0600 | [diff] [blame] | 143 | } |
| 144 | |
| 145 | // -------------------------------------------------------------------- |
Barry Mieny | dd67197 | 2010-10-04 16:33:58 +0200 | [diff] [blame] | 146 | |
Derek Jones | 98badc1 | 2010-03-02 13:08:02 -0600 | [diff] [blame] | 147 | /** |
| 148 | * Convert to UTF-8 |
| 149 | * |
Andrey Andreev | 3e9d2b8 | 2012-10-27 14:28:51 +0300 | [diff] [blame] | 150 | * Attempts to convert a string to UTF-8. |
Derek Jones | 98badc1 | 2010-03-02 13:08:02 -0600 | [diff] [blame] | 151 | * |
Andrey Andreev | 3e9d2b8 | 2012-10-27 14:28:51 +0300 | [diff] [blame] | 152 | * @param string $str Input string |
| 153 | * @param string $encoding Input encoding |
| 154 | * @return string $str encoded in UTF-8 or FALSE on failure |
Derek Jones | 98badc1 | 2010-03-02 13:08:02 -0600 | [diff] [blame] | 155 | */ |
Greg Aker | d2c4ec6 | 2011-12-25 22:52:57 -0600 | [diff] [blame] | 156 | public function convert_to_utf8($str, $encoding) |
Derek Jones | 98badc1 | 2010-03-02 13:08:02 -0600 | [diff] [blame] | 157 | { |
Andrey Andreev | be1496d | 2014-02-11 22:48:45 +0200 | [diff] [blame^] | 158 | if (ICONV_ENABLED) |
Derek Jones | 98badc1 | 2010-03-02 13:08:02 -0600 | [diff] [blame] | 159 | { |
Andrey Andreev | c123e11 | 2012-01-08 00:17:34 +0200 | [diff] [blame] | 160 | return @iconv($encoding, 'UTF-8', $str); |
Derek Jones | 98badc1 | 2010-03-02 13:08:02 -0600 | [diff] [blame] | 161 | } |
Andrey Andreev | 9f44c21 | 2012-10-10 16:07:17 +0300 | [diff] [blame] | 162 | elseif (MB_ENABLED === TRUE) |
Derek Jones | 98badc1 | 2010-03-02 13:08:02 -0600 | [diff] [blame] | 163 | { |
Andrey Andreev | c123e11 | 2012-01-08 00:17:34 +0200 | [diff] [blame] | 164 | return @mb_convert_encoding($str, 'UTF-8', $encoding); |
Derek Jones | 98badc1 | 2010-03-02 13:08:02 -0600 | [diff] [blame] | 165 | } |
Barry Mieny | dd67197 | 2010-10-04 16:33:58 +0200 | [diff] [blame] | 166 | |
Andrey Andreev | c123e11 | 2012-01-08 00:17:34 +0200 | [diff] [blame] | 167 | return FALSE; |
Derek Jones | 98badc1 | 2010-03-02 13:08:02 -0600 | [diff] [blame] | 168 | } |
| 169 | |
| 170 | // -------------------------------------------------------------------- |
Barry Mieny | dd67197 | 2010-10-04 16:33:58 +0200 | [diff] [blame] | 171 | |
Derek Jones | 98badc1 | 2010-03-02 13:08:02 -0600 | [diff] [blame] | 172 | /** |
| 173 | * Is ASCII? |
| 174 | * |
Andrey Andreev | 3e9d2b8 | 2012-10-27 14:28:51 +0300 | [diff] [blame] | 175 | * Tests if a string is standard 7-bit ASCII or not. |
Derek Jones | 98badc1 | 2010-03-02 13:08:02 -0600 | [diff] [blame] | 176 | * |
Andrey Andreev | 3e9d2b8 | 2012-10-27 14:28:51 +0300 | [diff] [blame] | 177 | * @param string $str String to check |
Derek Jones | 98badc1 | 2010-03-02 13:08:02 -0600 | [diff] [blame] | 178 | * @return bool |
| 179 | */ |
Greg Aker | d2c4ec6 | 2011-12-25 22:52:57 -0600 | [diff] [blame] | 180 | protected function _is_ascii($str) |
Derek Jones | 98badc1 | 2010-03-02 13:08:02 -0600 | [diff] [blame] | 181 | { |
Greg Aker | d2c4ec6 | 2011-12-25 22:52:57 -0600 | [diff] [blame] | 182 | return (preg_match('/[^\x00-\x7F]/S', $str) === 0); |
Derek Jones | 98badc1 | 2010-03-02 13:08:02 -0600 | [diff] [blame] | 183 | } |
| 184 | |
Derek Jones | 98badc1 | 2010-03-02 13:08:02 -0600 | [diff] [blame] | 185 | } |
Derek Jones | 98badc1 | 2010-03-02 13:08:02 -0600 | [diff] [blame] | 186 | |
Pascal Kriete | aaec1e4 | 2011-01-20 00:01:21 -0500 | [diff] [blame] | 187 | /* End of file Utf8.php */ |
Timothy Warren | 40403d2 | 2012-04-19 16:38:50 -0400 | [diff] [blame] | 188 | /* Location: ./system/core/Utf8.php */ |