Skip to content

A string processing rant

January 30, 2013

This post is part of a series – go here for the index.

Rants are not usually the style of this blog, but this one I just don’t want to keep in. So if you’re curious, the actual information content of this post will be as follows: C string handling functions kinda suck. So does C++’s std::string. Dealing with wide character strings using only standard C/C++ functionality is absolutely horrible. And VC++’s implementation of said functionality is a damn minefield. That’s it. You will not actually learn anything more from reading this post. Continue at your own risk.

UPDATE: As “cmf” points out in the comments, there are actually C++ functions that seem to do what I wanted in the first place with a minimum amount of fuss. Goes to show, the one time I post a rant on this blog and of course I’m wrong! :) That said, I do stand by my criticism of the numerous API flaws I point out in this post, and as I point out in myreply, the discoverability of this solution is astonishingly low; when I ran into this problem originally, not being familiar with std::wstring I googled a bit and checked out Stack Overflow and several other coding sites and mailing list archives, and what appears to be the right solution showed up on none of the pages I ran into. So at the very least I’m not alone. Ah well.

The backstory

I spent most of last weekend playing around with my fork of Intel’s Software Occlusion Culling sample. I was trying to optimize some of the hot spots, a process that involves a lot of small (or sometimes not-so-small) modifications to the code followed by a profiling run to see if they help. Now unfortunately this program, at least in its original version, had loading times of around 24 seconds on my machine, and having to wait for those 24 seconds every time before you can even start the profiling run (which takes another 10-20 seconds if you want useful results) gets old fast, so I decided to check whether there was a simple way to shorten the load times.

Since I already had the profiler set up, I decided to take a peek into the loading phase. The first big time-waste I found was a texture loader that was unnecessarily decompressing a larger DXT skybox texture, and then recompressing it again. I won’t go into details here; suffice it to say that once I had identified the problem, it was straightforward enough to fix, and it cut down loading time to about 12 seconds.

My next profiling run showed me this:

Loading hotspots

I’ve put a red dot next to functions that are called either directly or indirectly by the configuration-file class CPUTConfigFile. Makes you wonder, doesn’t it? Lest you think I’m exaggerating, here’s some of the call stacks for our #2 function, malloc:

Loading hotspots: calls to malloc()

Here’s the callers to #5, free:

Loading hotspots: calls to free()

And here’s memcpy, further down:

Loading hotspots: calls to memcpy()

I have to say, I’ve done optimization work on lots of projects over the years, and it’s rare that you’ll see a single piece of functionality leave a path of destruction this massive in its wake. The usual patterns you’ll see are either “localized performance hog” (a few functions completely dominating the profile, like I saw in the first round with the texture loading) or the “death by a thousand paper cuts”, where the profile is dominated by lots of “middle-man” functions that let someone else do the actual work but add a little overhead each time. As you can see, that’s not what’s going on here. What we have here is the rare “death in all directions” variant. Why settle for paper cuts, just go straight for the damn cluster bomb!

At this point it was clear that the whole config file thing needed some serious work. But first, I was curious. Config file loading and config block handling, sure. But what was that RemoveWhitespace function doing there? So I took a look.

How not to remove whitespace

Let’s cut straight to the chase: Here’s the code.

void RemoveWhitespace(cString &szString)
{
    // Remove leading whitespace
    size_t nFirstIndex = szString.find_first_not_of(_L(' '));
    if(nFirstIndex != cString::npos)
    {
        szString = szString.substr(nFirstIndex);
    }

    // Remove trailing newlines
    size_t nLastIndex = szString.find_last_not_of(_L('\n'));
    while(nLastIndex != szString.length()-1)
    {
        szString.erase(nLastIndex+1,1);
        nLastIndex = szString.find_last_not_of(_L('\n'));
    };
    // Tabs
    nLastIndex = szString.find_last_not_of(_L('\t'));
    while(nLastIndex != szString.length()-1)
    {
        szString.erase(nLastIndex+1,1);
        nLastIndex = szString.find_last_not_of(_L('\t'));
    };
    // Spaces
    nLastIndex = szString.find_last_not_of(_L(' '));
    while(nLastIndex != szString.length()-1)
    {
        szString.erase(nLastIndex+1,1);
        nLastIndex = szString.find_last_not_of(_L(' '));
    };
}

As my current and former co-workers will confirm, I’m generally a fairly calm, relaxed person. However, in moments of extreme frustration, I will (on occasion) perform a “*headdesk*”, and do so properly.

This code did not drive me quite that far, but it was a close call.

Among the many things this function does wrong are:

  • While it’s supposed to strip all leading and trailing white space (not obvious from the function itself, but clear in context), it will only trim leading spaces. So for example leading tabs won’t get stripped, nor will any spaces that follow after those tabs.
  • The function will remove trailing spaces, tabs, and newlines – provided they occur in exactly that order: first all spaces, then all tabs, then all newlines. But the string “test\t \n” will get trimmed to “test\t” with the tab still intact, because the tab-stripping loop will only tabs that occur at the end of the string after the newlines have been removed.
  • It removes white space characters it finds front to back rather than back to front. Because of the way C/C++ strings work, this is an O(N2) operation. For example, take a string consisting only of tabs.
  • The substring operation creates an extra temporary string; while not horrible by the standards of what else happens in this function, it’s now becoming clear why RemoveWhitespace manages to feature prominently in the call stacks for malloc, free and memcpy at the same time.
  • And let’s not even talk about how many times the string is scanned from front to back.

That by itself would be bad enough. But it turns out that in context, not only is this function badly implemented, most of the work it does is completely unnecessary. Here’s one of its main callers, ReadLine:

CPUTResult ReadLine(cString &szString, FILE *pFile)
{
    // TODO: 128 chars is a narrow line.  Why the limit?
    // Is this not really reading a line, but instead just reading the next 128 chars to parse?
    TCHAR   szCurrLine[128] = {0};
    TCHAR *ret = fgetws(szCurrLine, 128, pFile);
    if(ret != szCurrLine)
    {
        if(!feof(pFile))
        {
            return CPUT_ERROR_FILE_ERROR;
        }
    }

    szString = szCurrLine;
    RemoveWhitespace(szString);

    // TODO: why are we checking feof twice in this loop?
    // And, why are we using an error code to signify done?
    // eof check should be performed outside ReadLine()
    if(feof(pFile))
    {
        return CPUT_ERROR_FILE_ERROR;
    }

    return CPUT_SUCCESS;
}

I’ll let the awesome comments speak for themselves – and for the record, no, this thing really is supposed to read a line, and the ad-hoc parser that comes after this will get out of sync if it’s ever fed a line with more than 128 characters in it.

But the main thing of note here is that szString is assigned from a C-style (wide) string. So the sequence of operations here is that we’ll first allocate a cString (which is a typedef for a std::wstring, by the way), copy the line we read into it, then call RemoveWhitespace which might create another temporary string in the substr call, to follow it up with several full-string scans and possibly memory moves.

Except all of this is completely unnecessary. Even if we need the output to be a cString, we can just start out with a subset of the C string to begin with, rather than taking the whole thing. All RemoveWhitespace really needs to do is tell us where the non-whitespace part of the string begins and ends. You can either do this using C-style string handling or, if you want it to “feel more C++”, you can express it by iterator manipulation:

static bool iswhite(int ch)
{
    return ch == _L(' ') || ch == _L('\t') || ch == _L('\n');
}

template
static void RemoveWhitespace(Iter& start, Iter& end)
{
    while (start < end && iswhite(*start))
        ++start;

    while (end > start && iswhite(*(end - 1)))
        --end;
}

Note that this is not only much shorter, it also correctly deals with all types of white space both at the beginning and the end of the line. Instead of the original string assignment we then do:

    // TCHAR* obeys the iterator interface, so...
    TCHAR* start = szCurrLine;
    TCHAR* end = szCurrLine + tcslen(szCurrLine);
    RemoveWhitespace(start, end);
    szString.assign(start, end);

Note how I use the iterator range form of assign to set up the string with a single copy. No more substring operations, no more temporaries or O(N2) loops, and after reading we scan over the entire string no more than two times, one of those being in tcslen. (tcslen is a MS extension that is the equivalent of strlen for TCHAR – which might be either plain char or wchar_t, depending on whether UNICODE is defined – this code happens to be using “Unicode”, that is, UTF-16).

There’s only two other calls to RemoveWhitespace, and both of these are along the same vein as the call we just saw, so they’re just as easy to fix up.

Problem solved?

Not quite. Even with the RemoveWhitespace insanity under control, we’re still reading several megabytes worth of text files with short lines, and there’s still between 1 and 3 temporary string allocations per line in the code, plus whatever allocations are needed to actually store the data in its final location in the CPUTConfigBlock.

Long story short, this code still badly needed to be rewritten to do less string handling, so I did. My new code just reads the file into a memory buffer in one go (the app in question takes 1.5GB of memory in its original form, we can afford to allocate 650K for a text file in one block) and then implements a more reasonable scanner that processes the data in place and doesn’t do any string operations until we need to store values in their final location. Now, because the new scanner assumes that ASCII characters end up as ASCII, this will actually not work correctly with some character encodings such as Shift-JIS, where ASCII-looking characters can appear in the middle of encodings for multibyte characters (the config file format mirrors INI files, so ‘[‘, ‘]’ and ‘=’ are special characters, and the square brackets can appear as second characters in a Shift-JIS sequence). It does however still work with US-ASCII text, the ISO Latin family and UTF-8, which I decided was acceptable for a config file reader. I did still want to support Unicode characters as identifiers though, which meant I was faced with a problem: once I’ve identified all the tokens and their extents in the file, surely it shouldn’t be hard to turn the corresponding byte sequences into the std::wstring objects the rest of the code wants using standard C++ facilities? Really, all I need is a function with this signature:

void AssignStr(cString& str, const char* begin, const char* end);

Converting strings, how hard can it be?

Turns out: quite hard. I could try using assign on my cString again. That “works”, if the input happens to be ASCII only. But it just turns each byte value into the corresponding Unicode code point, which is blatantly wrong if our input text file actually has any non-ASCII characters in it.

Okay, so we could turn our character sequence into a std::string, and then convert that into a std::wstring, never mind the temporaries for now, we can figure that out later… wait, WHAT? There’s actually no official way to turn a string containing multi-byte characters into a wstring? How moronic is that?

Okay, whatever. Screw C++. Just stick with C. Now there actually is a standard function to convert multi-byte encodings to wchar_t strings, and it’s called, in the usual “omit needless vowels” C style, mbstowcs. Only that function can’t be used on an input string that’s delimited by two pointers! Because while it accepts a size for the output buffer, it assumes the input is a 0-terminated C string. Which may be a reasonable protocol for most C string-handling functions, but is definitely problematic for something that’s typically used for input parsing, where you generally aren’t guaranteed to have NUL characters in the right places.

But let’s assume for a second that we’re willing to modify the input data (const be damned) and temporarily overwrite whatever is at end with a NUL character so we can use mbstowcs – and let me just remark at this point that awesomely, the Microsoft-extended safe version of mbstowcs, mbstowcs_s, accepts two arguments for the size of the output buffer, but still doesn’t have a way to control how many input characters to read – if you decide to extend a standard API anyway, why can’t you fix it at the same time? Anyway, if we just patch around in the source string to make mbstowcs happy, does that help us?

Well, it depends on how loose you’re willing to play with the C++ standard. The goal of the whole operation was to reduce the number of temporary allocations. Well, mbstowcs wants a wchar_t output buffer, and writes it like it’s a C string, including terminating NUL. std::wstring also has memory allocated, and normal implementations will store a terminating 0 wchar_t, but as far as I can tell, this is not actually guaranteed. In any case, there’s a problem, because we need to reserve the right number of wchar’s in the output string, but it’s not guaranteed to be safe to do this:

void AssignStr(cString& str, const char* begin, const char* end)
{
    // patch a terminating NUL into *end
    char* endPatch = (char*) end;
    char oldEnd = *end;
    *endPatch = 0;

    // mbstowcs with NULL arg counts how many wchar_t's would be
    // generated
    size_t numOut = mbstowcs(NULL, begin, 0);

    // make sure str has the right size
    str.resize(numOut, ' ');

    // convert characters including terminating NUL and hope it's
    // going to be OK?
    mbstowcs(&str[0], begin, numOut + 1);

    // restore the original end
    *endPatch = oldEnd;
}

This might work, or it might not. As far as I know, it would be legal for a std::wstring implementation to only append a trailing NUL character lazily whenever c_str() is first called on a particular string. Either way, it’s fairly gross. I suppose I could resize to numOut + 1 elements, and then later do another resize after the mbstowcs is done; that way should definitely be safe.

Either way is completely beside the point though. This is an actual, nontrivial operation on strings that is a totally reasonable thing to do, and that the C IO system will in fact do for me implicitly if I use fgetws. Why are all the functions dealing with this so horribly broken for this use case that’s not at all fancy? Did anyone ever look at this and decide that it was reasonable to expect people to write code like this? WHAT THE HELL?

It gets better

That’s not it quite yet, though. Because when I actually wrote the code (as opposed to summarizing it for this blog post), I didn’t think to patch in the NUL byte on the source string. So I went for the alternative API that works character by character: the C function mbtowc. Now, awesomely, because it works character by character, and is not guaranteed to see all characters in a multi-byte sequence in the same call, it has to keep state around of which partial multi-byte sequences it has seen to be able to decode characters. So it’s not thread-safe, and POSIX defines an extended version mbrtowc that makes you pass in a pointer to that state which does make it thread-safe. At this point though, I don’t care about thread-safety (this code is single-threaded anyway), and besides, in our case I actually know that the characters between begin and end are supposed to parse correctly. So I just don’t worry about it. Also, instead of actually counting the right number of wchar_t‘s ahead of time in a second pass, I just assume that the string is generally likely to have less wide characters than the source multi-byte string has bytes. Even if that turns out wrong (which won’t happen for conventional encodings), the std::wstring we write to can dynamically resize, so there’s not much that can go wrong. So I ended up with this implementation:

void AssignStr(cString& dest, const char* begin, const char* end)
{
    dest.clear();
    if (end <= begin)
        return;

    size_t len = end - begin;
    size_t initial = len + 1; // assume most characters are 1-byte
    dest.reserve(initial);

    const char* p = start;
    while (p < end)
    {
        wchar_t wc;
        int len = mbtowc(&wc, p, end - p);
        if (len < 1) // NUL byte or error
            break;

        p += len;
        dest.push_back(wc);
    }
}

Looks fairly reasonable, right?

Well, one profiling session later, I noticed that performance had improved, but it turned out that I was apparently wrong to assume that, like its std::vector counterpart, std::wstring::push_back would basically compile into the moral equivalent of dest.data[dest.len++] = wc. Instead, what I saw in VTune (with a kind of morbid fascination) was about two dozen instructions worth of inlined insanity surrounding a call to std::wstring::insert. For every character. In a release build.

It’s probably the VC++ STL doing something stupid. At this point, I don’t feel like investigating why this is happening. Whatever, I’m just gonna add some more to this layer cake of insanity. Just stop thinking and start coding. So I figure that hey, if adding stuff to strings is apparently an expensive operation, well, let’s amortize it, eh? So I go for this:

void AssignStr(cString& dest, const char* begin, const char* end)
{
    dest.clear();
    if (end <= begin)
        return;

    static const int NBUF = 64;
    wchar_t buf[NBUF];
    int nb = 0;

    size_t len = end - begin;
    size_t initial = len + 1; // assume most characters are 1-byte
    dest.reserve(initial);

    const char* p = start;
    while (p < end)
    {
        int len = mbtowc(&buf[nb++], p, end - p);
        if (len < 1) // NUL byte or error
            break;

        p += len;
        if (p >= end || nb >= NBUF)
        {
            dest.append(buf, buf + nb);
            nb = 0;
        }
    }
}

And it’s still slow, and I still get a metric ton of bullshit inlined for that call. Turns out this happens because I call the general “input iterator” variant of append which, go figure, adds character by character. Silly me! What I really should’ve called is dest.append(buf, nb). Of course! Once I figure that one out, I profile again, and sure enough, this time there’s no magic std::string functions cluttering up the profile anymore. Finally. Mission accomplished, right?

Not so fast, bucko.

Ohhh no. No, there’s one final “surprise” waiting for me. I put surprise in quotes because we already saw it in my first profile screenshot.

The final surprise

Yeah right. Those C functions we’ve been calling? In the VC++ C runtime library, all of them end up calling a constructor for a C++ object for some reason.

No, I’m not gonna comment on that one. I stopped caring a few paragraphs ago. Go ahead, put C++ code in your C runtime library. Whatever makes you happy.

So it turns out that VC++ has two versions of all the multibyte conversion functions: one that uses the current locale (which you can query using _get_current_locale()) and one that takes an explicit locale_t parameter. And if you don’t pass in a locale yourself, mbtowc and so forth will call _get_current_locale() themselves, and that ends up calling a C++ constructor for some reason. (I don’t care, I’m in my happy place right now. La la la).

And I finally decide to screw portability – hey, it’s a VC++-only project anyway – and call _get_current_locale() once, pass it to all my calls, and the magic constructor disappears, and with it the last sign of dubious things happening in the string handling.

Hooray.

Conclusions

So, what do we have here: we have a C++ string class that evidently makes it easy to write horrendously broken code without noticing it, and simultaneously doesn’t provide some core functionality that apps which use both std::wstring and interface with non-UTF16 character sets (which is almost nobody, I’m sure!) will need. We have C functions that go out of their way to make it hard to use them correctly. We have the Microsoft camp that decides that the right way to fix these functions is to fix buffer overflows, and we have the POSIX camp that decides that the right way to fix them is to fix the race condition inherent in their global state. Both of these claim that their modifications are more important than the other’s, and then there’s the faction that holds the original C standard library to be the only true way, ignoring the fact that this API is clearly horribly broken no matter how you slice it. Meanwhile, std::wstring gets another attention fix by making it unnecessarily hard to actually get data from C APIs into it without extra copying (and may I remind you that I’m only using C APIs here because there doesn’t seem to be an official C++ API!), while the VC++ standard library proves its attention deficit by somehow making a push_back to a properly pre-allocated string an expensive operation. And for the final act of our little performance, watch as a constructor gets called from C code, a veritable Deus Ex Machina that I honestly didn’t see coming.

As my friend Casey Muratori would put it: Everyone is fired.

And now excuse me while I apply some bandages and clean the blood off my desk.

From → Coding

44 Comments
  1. Hilarious, loved it!
    Sometimes it’s disturbing what things you’ll find if you decide to “dig” in a bit deeper…

  2. cmf permalink

    what about http://en.cppreference.com/w/cpp/locale/codecvt/in?
    not sure if it’d faster, but at least it explicitly takes a locale.

    or this? http://en.cppreference.com/w/cpp/locale/wstring_convert/from_bytes

    • Thanks! That looks like the functions I was actually looking for. Goes to show – I post a rant without double-checking for once, and of course I end up being wrong :). Ah well, I exaggerated for comedic effect anyway.

      That said, I do stand by the actual criticism I make of the APIs I did end up using. I’ll add a note to the post, but not update the main text, since I haven’t done any profiling on this – for all I know, this might turn up some new performance gotchas (which were, after all, the reason behind me looking into this in the first place).

      And finally, I’m honestly not sure how I would’ve found this without your post. My initial Google search query “std::string to std::wstring” turned up pages like this (which does the conversion using the incorrect “assign” way), this</this (on the Boost ML, of all places!) which also only mentions the corresponding C functions and the Boost serialization library (together with various approaches that don't work), this code review on Stack Exchange (which has solutions that work on Windows, but all using platform-specific calls), or this lengthy thread on Stack Overflow, which mentions <locale> not once. So, to my defense, at least I’m not alone in doing this wrong. :)

  3. Sylvain V. permalink

    Why does that feel too familiar? Déjà vu in some corporation. I feel your pain… :(

  4. db312 permalink

    Nice writing ;). Whoever wrote the original RemoveWhitespace function needs to re-evaluate their dedication to their job. I’m going to assume they know what they are doing and chalk it up to extreme laziness or apathy.

  5. Ricardo permalink

    Stuff like this make me almost want to quit programming…

    I really wish that somebody would implement a portable, powerful, easy to use and well implemented C/C++ string handling library. :-/

    Any thoughts about Apple’s NSString (language specificness and umportability aside)? I found it quite convenient to use and less prone to gotchas.

  6. I find that handling actual Unicode in C++ is a nightmare unless you’re writing code inside an app that has already solved that problem. When I’m writing code in Mozilla it’s a non-problem, but elsewhere I usually just attempt to avoid the problem or say “everything is UTF-8”. The standard library is a mess here, and nobody has really bolted on a sane library. On the plus side, C++ is getting a little saner in C++11, where std::[w]string are defined to be null-terminated, and .c_str() == .data, which is the behavior everyone wants and all STL varieties implement, but was not specified in the standard until this point. (http://stackoverflow.com/a/7554172/69326)

    I’m just not sure any language designed before Unicode expanded past the BMP can actually handle it sanely. Rust looks like it’s shaping up to do okay here, treating everything as UTF-8.

    • db312 permalink

      UTF-8 is indeed a mess. UTF-16 not so much. I much prefer UTF-16 in all cases. The wide characters make it inherently distinguishable from ASCII, unlike UTF-8. If you’re forced to use UTF-8, I would always convert it to UTF-16 incoming, then back to UTF-8 outgoing.

      • Oleg permalink

        What are the reasons to prefer UTF-16 apart from being distinct from ASCII?

      • How do you portably use UTF-16 in C++? Isn’t wstring UTF-32 on most (all?) Unix-y platforms?

      • std::u16string as of C++11, but I’d recommend against it. UTF-8 is by far the most practical Unicode encoding, both in terms of interfacing with existing SW (except maybe Windows) and in terms of other portability headaches like endianness issues.

      • Yeah, someone else pointed that out on the Hacker News. I’m very glad it was added to C++11, but I’d argue that “needs C++11” is still fairly non-portable, alas.

        My conclusion was to go UTF-8 as much as possible too, but I’m not so set on that I’m unwilling to listen to other notions.

      • Well, since std::string is really just a typedef anyway, you can also just define your own std::basic_string, which should be portable enough. Of course all the actual work is in the conversion and locale handling!

      • foljs permalink

        UTF-8 is a mess? Compared to …UTF-16?

        You have not worked much with unicode and international languages, right?

        UTF-16 is like the PHP of the Unicode encoding world. UTF-8 is infinitely better designed.

        Except if you speak about your experience of handling one or the other with some specific C/C++ API, which has nothing to do with the inherent quality of them, but of the API.

      • foljs permalink

        Hmm, I guess what you are referring to as “UTF-16” being good is the broken way MS handles it? Please, don’t go around comparing UTF-16 and UTF-8. UTF-16 is utterly borked.

      • UTF-16: All the same problems at UTF-8, but using two bytes instead of one for everything.

      • UTF-8 is fairly painless in normal usage. UTF-16 isn’t, not by a long shot.

    • Ludovic Urbain permalink

      Actually treating everything as UTF8 is a major overhead to begin with.

      I honestly believe strings themselves are a major overhead to begin with, and should not be used if no human is going to read them ever.

      Lastly, UTF-8 seems to be handled just fine in some other languages, leading me to believe it can’t be that hard to write your own UTF-8 functions, IF you really need to manipulate strings.

  7. Another nitpick: There is no need for SAFE_DELETE / SAFE_DELETE_ARRAY:

    SoftwareOcclusionCulling/CPUT/CPUT/CPUT.h:#define SAFE_DELETE(p) {if((p)){HEAPCHECK; delete (p); (p)=NULL;HEAPCHECK; }}
    SoftwareOcclusionCulling/CPUT/CPUT/CPUT.h:#define SAFE_DELETE_ARRAY(p){if((p)){HEAPCHECK; delete[](p); (p)=NULL;HEAPCHECK; }}

    The C++ standard guarantees that one can free NULL pointer just fine with delete/delete[]/free

    • I think we can assume that SAFE_ refers to the heap-check, not the NULL-check. Since the HEAPCHECK is probably expensive, it makes some sense to not run it if the code can’t actually modify the heap (at least the post-check).

  8. Jok permalink

    If you’re targeting Windows only, you should take a look at MultiByteToWideChar/WideCharToMultiByte.

  9. Sorry, but this is all a mess what is described here. It’s about the following:

    1) You cannot store an Unicode character encoded in UTF-16 in 16bit. That’s just wrong. So std::wstring cannot do that. Why? Because a character in Unicode can consist of more than one code points.

    2) You cannot store an Unicode code point encoded in UTF-16 in 16bit. That’s wrong, too. So std::wstring cannot do that. Why? Because there are more code points than 65535.

    3) The string search code is just wrong. In UTF-16, too, a code point is variable length, like in UTF-8. It’s not correct to use a std::wstring and iterate for code points – not to mention it’s just plain wrong for characters.

    4) The search and replace functions therefore are all wrong. And even worse, I cannot find the correct identification of whitespace in Unicode in this article at all – it’s all characters which have the binary property White_Space.

    Conclusion: better use the Unicode string implementation of your operating system, or use one of the common libraries like ICU, http://icu-project.org/

    BTW: the only Unicode standard encoding I know where you easily can jump into code points (not: characters) is UTF-32. UTF-8 is not a mess at all compared to UTF-16. Dealing correctly with UTF-16 is way more complex than with UTF-8. I’m pretty sure Windows does UTF-16 now because they startet with UCS-2, which is flawed. And no, these two are not the same.

    To remember: these are the most common string types in Windows’ C/C++ API:

    PSTR, PWSTR, PCSTR, PTSTR, LPSTR, LPWSTR, LPTSTR, LPCSTR, LPCWSTR, LPCTSTR, BSTR, _bstr_t, std::string, std::wstring, CString (ATL), CString (MFC), CStringW (ATL), CStringW (MFC), CStringT (ATL), CStringT (MFC), CStringData, System::String, System.Text::StringBuilder, String^, CComBSTR (ATL), CComBSTR (MFC), char *, wchar_t *, tchar_t *, WCHAR *, TCHAR *, _TCHAR *, _TSCHAR *, _TUCHAR *, OLESTR, OLECHAR *

    None of them have a correct implementation for Unicode. Yes, that’s right: even the “modern” String^ type of .NET is wrong, i.e.:

    http://msdn.microsoft.com/en-us/library/czx8s9ts.aspx


    String^ Replace (
    wchar_t oldChar,
    wchar_t newChar
    )

    Well, but there are more Characters than wchar_t can address. And it’s even worse:

    http://msdn.microsoft.com/en-us/library/system.char.aspx

    Char Structure

    Represents a character as a UTF-16 code unit.

    That means, characters which don’t fit in an UTF-16 code unit cannot be used. A code unit is defined as minimal bit combination that can represent a unit of encoded text.

    In Metro there is now some hope at last:

    There are HSTRING, HSTRING_BUFFER, HString, Platform::String. And there is:

    HRESULT WINAPI WindowsReplaceString(
    _In_ HSTRING string,
    _In_ HSTRING stringReplaced,
    _In_ HSTRING stringReplaceWith,
    _Out_ HSTRING *newString
    );

    which at last could work (I didn’t test extensively yet). But this is only in Metro. How do I handle that?

    I’m using the ICU library on any platform. And when ready, I’m converting for API calls.

    • “1) You cannot store an Unicode character encoded in UTF-16 in 16bit.”
      Who said you could? Windows “Unicode” used to be UCS-2 with a 16-bit wchar_t. They’re still stuck with that wchar_t, but now it’s UTF-16, and that’s what the multibyte->wide functions on Windows output. Which is another level of insanity, but that’s just how everything on Windows expects to get its strings. In this case, it doesn’t matter much, because all the actual control characters in the text file format the original code tried to read are at codepoints that end up with a single UTF-16 character – in fact, they’re all ASCII. And since UTF-16 surrogate pairs, like UTF-8 encodings, keep the ranges for self-encoding characters and those for escaped characters disjoint, the parsing works just fine.

      “2) You cannot store an Unicode code point encoded in UTF-16 in 16bit.”
      As said, where did I claim otherwise?

      “3) The string search code is just wrong.”
      By the property quoted above, in this case it is completely fine and well-defined. This is parsing a configuration file in a particular format with a grammar that recognizes only a certain number of control characters, all of which are low ASCII. Yes, I do not bother parsing surrogate pairs into code points, but nor do I need to – they’re just passed through verbatim in any case.

      “4) The search and replace functions therefore are all wrong. And even worse, I cannot find the correct identification of whitespace in Unicode in this article at all”
      I am not trying to parse free-form text or do collation. This is a parser for a text file format – namely, the text file format that is (supposed to be) recognized by the previous version of the code. That code only recognizes ‘\t’ (U+0009), ‘\n’ (U+000a) and ‘ ‘ (U+0020) as white space. This is a machine-readable format that has a grammar, and the whole point is that I need to be reading *the same data files* before and after. In the source data, the three given characters count as white space, so that’s what I have to use. Which Unicode characters have the “White_Space” property is completely immaterial here.

      As for the rest, in my own projects, I’m down to exclusively using UTF-8 for everything at this point, and converting to UTF-16 when talking to Windows APIs as necessary. But in this case, I don’t get to choose; I was working on an existing codebase on Windows that uses std::wstring with 16-bit wchar_t. Short of rewriting every piece of code in this project that deals with strings, there’s not much I can do to fix the situation. So I changed the file loading code in a manner that is consistent with the way the rest of the code expects, so that existing data continues to work, no matter how ill-defined it may happen to be.

  10. Mark Ransom permalink

    The problems with wstring are best dealt with by avoiding them altogether.

    http://www.utf8everywhere.org/

    Of course that’s not practical for many existing projects but it’s something to strive for.

  11. Nice article. Very well written. You’re dedication to detail is admirable. What profiler are you using? If you don’t mind me asking.

  12. I had to compile ICU and deploy so I can get Qt5 with it. It’s quite big library, I wasn’t expecting UNICODE to be such a big problem, but it looks like it is (lots of languages, scripts, etc. – lots of runes/code-points to say).

    But ICU also comes with a lot of other things – like text layouting, dealing with fonts, calendars, etc. It’s a bit overkill, and not a small library to use.

    • Yes, ICU is really complete – and correct. That’s the reason why I’m using it. You don’t need to link parts you don’t want, of course. Why did I end up there? Well, I found nothing else which was complete and correct. Not kidding. Yes, I was searching the internet ;-)

      • dom0 permalink

        libunistring is a much smaller alternative to icu, but it (of course) doesn’t support as much encodings et cetera as icu. Still, for, say, 99 % of all applications libunistring should suffice.

  13. That’s what programming has become in modern times: you invoke APIs, which are opaque to you and themselves invoke older APIs, themselves opaque to the programmers who used them, and so on… In the end, all gets released under the pressure of urgency.

    My rule is: for functionalities that are intensively used, in the middle of innermost loops, do not trust anybody but yourself.

  14. Thank God I use Java :-)

  15. ?? glad i stuck to c.

  16. opwernby permalink

    static bool iswhite(int ch)
    {
    return ch < _L(' ');
    }

    …because you pretty much don't want any other control characters in there either.

  17. opwernby permalink

    Sorry – that should’ve said “<="

    • As I’ve said in some other replies, that’s not a good idea. This is a structured text format that gets parsed, and it loads data that I don’t control. As a general rule, when working on parsers, you shouldn’t ever change the accepted input language without good reason.

  18. ChadF permalink

    And some non-native english speaking people wonder why all software/source code doesn’t just use unicode everywhere. ;)

    Yes.. if your program is expected to be used for more than just data of your own [human] language, then make sure you do good text support (or using a [programming] language that natively supports it). But, in general, for my own needs, good ‘ol ASCII (well, 8-bit raw) is good enough for me! =)

  19. Very good info. Lucky me I found your website by chance
    (stumbleupon). I have saved as a favorite for later!

  20. When I see a post about string programming, I immediately search for emoji. No results… You’re yet to experience a new level of hell.

    Seriously though, great post.

    • Emoji aren’t a string processing problem. They’re a text rendering problem, which is a different beast entirely.

      • Well — no. Not only some Emojis not fit into a standard UTF16 character and require surrogate pair, some Emojis actually don’t fit into a Unicode character, and are represented as 2 (TWO) different Unicode characters, one of which can be perfectly valid on it’s own. So, when I had an interesting experience of fixing the the string processing in C# (which uses UTF16 by default), I ended up writing my own string container class to deal with all the insanity.

Trackbacks & Pingbacks

  1. Optimizing Software Occlusion Culling – index « The ryg blog

Leave a comment