@kyanny's blog

My life. Opinions are my own.

Unicode ZERO WIDTH SPACE problem

Today I encountered a problem with XMLRPC call.

An application calls external XMLRPC service via Ruby's xmlrpc/client library. Sometimes server returns error message such as Invalid XMLRPC request "not well-formed (invalid token)". XML document includes various data.

I think that XML document includes any control character. I tried to dump data to text file. I read data by less(1) and found `ESC' and <U+200B>.

(ESC(I00;)

First made&lt;U+200B&gt;
&lt;U+200B&gt;.Lovely meal.

ESC is escape sequence. I opened data by Emacs as US-ASCII encoding, finally I found raw escape sequence, `^['.

<U+200B> is Unicode character that means ZERO WIDTH SPACE. I saw \342\200\212 in Emacs US-ASCII buffer. I'm not good for Unicode, But I can

In ruby world, this code represents ZERO WIDTH SPACE character.

https://gist.github.com/1098835

RUBY_VERSION #=>
zero_width_space = "\xE2\x80\x8B" #=>
zero_width_space #=>
zero_width_space.length #=>
white_space = "" #=>
white_space #=>
white_space.length #=>

Result is here. In Ruby 1.8.7, ZERO WIDTH SPACE is 3 bytes length string. In Ruby 1.9.2, ZERO WIDTH SPACE is 1 byte string.

RUBY_VERSION # => "1.8.7"
zero_width_space = "\xE2\x80\x8B" # => "\342\200\213"
zero_width_space # => "\342\200\213"
zero_width_space.length # => 3
white_space = "" # => ""
white_space # => ""
white_space.length # => 0

RUBY_VERSION # => "1.9.2"
zero_width_space = "\xE2\x80\x8B" # => "&#8203;"
zero_width_space # => "&#8203;"
zero_width_space.length # => 1
white_space = "" # => ""
white_space # => ""
white_space.length # => 0

I chopped these characters by String#sub! and String#gsub!.

string.sub!(/\e/,'')

string.gsub!(/#{zero_width_space}/,'')

It worked well.

id:dayflower gives easy to understand description for ZERO WIDTH SPACE.