November 2006

Deal with Word characters in ISO-8559-1 [solved]

I spent way too much time on this, trying all sorts of solutions, for how easy it finally was to solve. I want to list everything I tried though first so others having this same problem might find this post.

I have a new programming gig. I was assigned a bug where the application we’re using generated a SOAP fault if a user pasted content from a Word doc into a form. The SOAP call specified ISO-8559-1 as the character set that it uses, and Word evidentially uses another character set. I tried changing the call to UTF-8 but it didn’t help.

Using ord() in PHP to get the character numbers of the bad characters, I detected that the nasty characters included chr(145), chr(146), chr(147), chr(148), chr(133), chr(150), chr(151), chr(210), chr(211), chr(128), chr(153), chr(156), and chr(157). These are the stupid “smart” or “balanced” quotation marks, left and right single and double quotes, em dashes, en dashes, and elipses (three dots …) .

However, I was unable to find and replace these characters using a variety of methods. It turned out the encoding from Word was in Windows-1252, which is a multibyte encoding that str_replace isn’t going to find when using chr().

I tried mb_check_encoding but it didn’t work, possibly because it wasn’t supported on my server. I tried iconv() but it kept giving me “illegal character” and “cannot convert encoding” errors. Frustratingly, iconv did not honor the //TRANSLIT and //IGNORE flags as documented on, which were supposed to convert or ignore the bad characters automatically.

I did find a preg_replace function on the comments of iconv on that stopped the soap faults by converting all bad characters to question marks, but the question marks wouldn’t have satisfied the client, and on continuous re-edits the question marks doubled and tripled themselves.

Finally, I found the answer here:
(BTW, when did they open up the webmasterworld forums again? They’ve been locked for like a year. Sweet!)
It turns out if you add an accept-charset=”UTF-8″ flag in your input form tag, the browser will do all the character conversions for you automatically before it even reaches the PHP, SOAP or Java. Even though my SOAP call was in ISO-8559-1, UTF-8 passes through fine and all of the weird word gremlins remain unchanged.

—–  ©®™…‘’“”


Can’t get DP to recognize the MPD’s transport [unsolved]

Digital performer won’t recognize the MMC messages coming from the MPD 24’s transport. Anyone know how to get it to work? It’s not listed as a control surface option and doesn’t seem to listen to it via any MIDI inputs.

On the plus side, I got the MPD remote functionality working with Reason, the problem was I had to upgrade to Reason 3.0.5 for the luacodec to work. If I launch reason after DP then Reason can listen to the MPD via transport and control DP’s transport via rewire. Sweet.