I remember a big thread discussing the challenges of implementing Unicode support several years ago.
http://forums.multiedit.com/viewtopic.php?f=4&t=1717Part of the problem is that the developer that was discussing unicode support seemed to be a bit of a perfectionist, which I understand entirely, and didn't want to implement a half-assed hack, which
some other editors have done.
IIRC, ME usas a fixed column model in which characters are directly and quickly accessible by column position, e.g. the 175th character. With Unicode, a single character can be an unlimited number of bytes, so a simple array of quickly-addressable characters is not possible. My initial suggestion was to store each character as a struct that contains the basic 32-bit unicode character itself plus aa pointer to a linked list of Combining Diacritical Marks if there are any. That would be quite wasteful as the vast majority of characters would not have any modifiers. An alternative would be for each line to have a pointer to a linked list of character positions that have CDMs, e.g. if the 17th and 193rd characters have CDMs, then the linked list would contain 17 followed by a pointer to the list of CDMs for char 17, and 193 plus a pointer to character 193's CDMs. This list would need to be maintained any time that a character was inserted or deleted in a line that had characters with CDMs. So it's a classic space/time tradeoff decision. They've probably settled on a solution by now.