Despite it needs much more tests, I have been successfully implementing a library that can understand and manipulate strings in the following Unicode representations:
- UTF8;
- UTF16-LE;
- UTF16-BE;
- UTF32-LE;
- UTF32-BE.
By implementing the encodes above, I will automatically be able to deal with other simpler standards, like for example:
- Latin 1;
- UC2;
- UC4.
But what does it all mean? It means I will understand the binary encode of Unicode strings, or, in other words, I will know what code points a string represents. Code points are unique numbers representing characters or symbols.
It's pretty nice, but it's just half the way to go. Since I have the code points of a string, I know for sure what it should mean, but I don't know how the related chars or symbols look like. There is a long and hard way for a complete solution.
No comments:
Post a Comment