Unicode is a real bear to work with. I found that out when I wrote the unicode translator front end. I originally wanted to make it translate a file of any size. The usual way, break up the input into manageable blocks and process each block. Breaking up the stream turned out to be waaaaay complicated.
You Can't Just Break the Stream Anywhere
It's a bear to find a place in the stream where you can break. You can't just break at any arbitrary 16-bit word boundary. There are "Surrogate Pairs", two 16-bit words that go together, and you don't want to break that in half. And that's just the beginning.
There are also Composite Characters, where you have a base character, followed by a bunch of punctuation characters which go together. e.g. you could have the letter A, followed by a composite tilde to put on top of the A. And that can be followed by an unlimited number of other additional fancy things to add to the letter A.
Illegal degenerate streams
Then there's the possibility of having an illegal sequence of 16-bit words, and other degenerate possibilities, such as having nothing but an endless stream of composite additions with no base character to begin it all.
Grapheme Break
To break up the stream, assuming the stream itself isn't an illegal degenerate sequence, you have to find a Grapheme Break. I started writing code to do that, and pushed and pushed to get something that worked, and eventually just threw it all away. That's why the unicode front end just does it all at once, and if the input file is too big, too bad.
Unicode Core Package
Someone else must have made a code package that handles all these unicode problems. There are numerous editors in the free Linux world which support unicode to some degree.
Last September I finally found such a package (which I think is free even for commercial use. You can check that):
The package is ICU4J version 4.0 released July 2, 2008.
http://icu-project.org/
See ICU User Guide:
http://download.icu-project.org/files/i ... rguide.zip
(J is the Java version, which I think in the long run will be less troublesome than the corresponding C version.)
UTF-8 or UTF-16?
All Linux programs use the UTF-8 format for unicode. Windows uses UTF-16. I'm leaning towards UTF-8 as having the better advantage. Saves memory space internally, which could be a big performance boost. Could greatly reduce page faults, greatly improving speed, even with the extra converting from UTF-8 to UTF-16 whenever you wanted to call a Windows system routine. Overall it could be much faster. [I'm not sure if it also gets rid of the problem of having a stream riddled with NULL bytes. I'll have to recheck that.
03/27/2009: Yes, there are no NULL bytes in UTF-8 (with the single exception of the NULL character itself. UTF-8 byte values 0 - 127 are still the standard ASCII characters.)] That's probably going to be the biggest design decision to make. That major decision will affect the long term future of Multi-Edit.