10/11/2010

10-11-10 - Windows 1252 to ASCII best fit

I'd like to construct a Windows 1252 to ASCII (7 bit) best fit visual character mapping (eg. accented a -> a , vertical bar linedraw -> | , etc.). I can't find it. ... okay I did it ..

const int c_windows1252_to_ascii[256] = 
{
  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15,
 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79,
 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95,
 96, 97, 98, 99,100,101,102,103,104,105,106,107,108,109,110,111,
112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,
 35,102, 34, 46, 35, 35, 94, 35, 83, 60, 79, 35, 90, 35, 90, 35,
 35, 39, 39, 34, 34, 46, 45, 45,126, 84,115, 62,111, 35,122, 89,
 32, 33, 99, 35, 36, 89,124, 35, 35, 67, 97, 60, 35, 45, 82, 35,
 35, 35, 50, 51, 35, 35, 35, 46, 44, 49,111, 62, 35, 35, 35, 35,
 65, 65, 65, 65, 65, 65, 65, 67, 69, 69, 69, 69, 73, 73, 73, 73,
 68, 78, 79, 79, 79, 79, 79, 35, 79, 85, 85, 85, 85, 89, 35, 35,
 97, 97, 97, 97, 97, 97, 97, 99,101,101,101,101,105,105,105,105,
 35,110,111,111,111,111,111, 35,111,117,117,117,117,121, 35,121
};

( see this table as visible chars at cbloom.com )

this was generated by the Win32 functions and it's not perfect. It gets the accented chars right but it just gives up on the funny chars and it puts the "default" char in, which I set to "#" (35) which is probably not the ideal choice.

So anyway, it would be better to have a hand-tweaked table if you can find it.

Some links I found that were not particularly helpful :

Index of PublicMAPPINGSVENDORSMICSFTWindowsBestFit
Character sets
Cast of Characters- ASCII, ANSI, UTF-8 and all that
ASCII Table 7-bit
ASCII Character Map
ANSI character set and equivalent Unicode and HTML characters
A tutorial on character code issues


Also, in further news of the printf with wide chars considered harmful front, I've discovered it can cause execution to break, not merely fail to convert the string well.

I get some wchar string from some perfectly reasonable source (such as MultiBytetoWideChar or from a file name) and try to print it with printf %S (capital S for wide chars). The problem is at this point (output.c in the MSVC CRT) :


    e = _WCTOMB_S(&retval, L_buffer, _countof(L_buffer), *p++);
    if (e != 0 || retval == 0) {
        charsout = -1;
        break;
    }

because it's decided that the wchar is no good for some reason. wctomb_s will fail "if the conversion is not possible in the current locale". It winds up failing the whole printf (which causes it to return -1 and set errno). WTF don't fail my entire printf because you can't map one of the wchars. So fucked.

(I also have no clue why this particular wchar was failing to convert; it was like a squiggly f looking thing, it showed up just fine in the MSVC watch window, but for some reason the CRT locale shit didn't like it).

see :

_set_invalid_parameter_handler (CRT)
_setmbcp (CRT)
setlocale, _wsetlocale (CRT)

Anyway, my recommended best practice remains "don't use wide chars in printf" , unless you use autoprintf and let it convert them to console code page for you. (note that wstrings are converted automatically, but raw wchars you have to call ToString() to make them convert)

If you use autoprintf and put this somewhere, it will handle the std string variants nicely :


START_CB
inline const char * autoprintf_StringToChar (const std::string & rhs)
{
    return rhs.c_str();
}

inline const String ToString( const std::wstring & rhs)
{
    return autoPrintfWChar(rhs.c_str());
}
END_CB

1 comment:

cbloom said...

I think this has failed to post for the last two months and it just went up now.

The failure was because I had some out of range ascii characters, which causes an error 400 from blogger.

TODO : my TextBlog needs to pop up a GUI messagebox when it has an error.

old rants