问题描述:

I have a text that i need to clean some characters. This characters are showed in the pictures i attached to the question. I want to replace them with white space x20.

My attempt was to use preg_replace.

$result = preg_replace("/[\xef\x82\xac\x09|\xef\x81\xa1\x09]/", "\x20", $string);

For a particular case this approach works, but for some cases it won't, because for example i had a text with a comma and it matched x82 and removed it from that text.

How could i write my regex to search exact this sequence ef 82 ac 09, or the other one ef 81 a1 09, and not for each pair separately like ef 82 ac 09?

网友答案:

1.) You match any of the 6 different hex bytes or pipe character in the character class. Probably wanted to use a group (?:...|...) for matching the different byte sequences.

2.) Also the byte sequences do not match the image. Seems like you messed up two bytes. The picture shows: ef 82 a1 09 and ef 81 ac 09 vs your try: \xef\x82\xac\x09 | \xef\x81\xa1\x09

3.) When testing your input sample

$str = "de la nouvelle;      Fourniture $         Option :";

foreach(preg_split("//u", $str) AS $v) {
  var_dump($v, bin2hex($v)); echo "\n";
}

it turned out, that 09 was too much. The characters to be removed are actually ef81ac and ef82a1. So the right regex would be (?:\xef\x81\xac|\xef\x82\xa1)

$result = preg_replace("/(?:\xef\x81\xac|\xef\x82\xa1)/", "\x20", $string);

See test at eval.in

网友答案:

If the content of the whole file is UTF-8 encoded text, then you might want to remove the characters from Private Use Area, since \xef\x82\xac decodes to code point U+F0AC and \xef\x81\xa1 decodes to code point U+F061, which belong to the Private Use Area U+E000..U+F8FF.

$result = preg_replace("~\p{Co}~u", " ", $input);

\p{Co} is the character class of all characters belong to Other, Private Use category in Unicode, which includes all characters in the 3 ranges U+E000..U+F8FF, U+F0000..U+FFFFD, U+100000..U+10FFFD.

相关阅读:
Top