Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
How to deal with unicode in strings?
#1
I'm trying to wrtite macros that deal with unicode characters - and noticed that very soon unicode is lost:
Code:
Copy      Help
str s.getclip;
paste s; 
str b=s;b=b.left(b 1);
paste b;

the first paste is fine, b hasn't survived. Do I need a different declaration to enable unicode for b or what's the problem?
#2
QM string encoding is UTF8. It means variable-length characters. Some characters are 1 byte, some 2, 3 or 4 (rarely).
When you know character length, simply use it in code.

Macro Macro3017
Code:
Copy      Help
str s="ąbc" ;;first character is 2 bytes
str bad.left(s 1) ;;gets half of character
out bad
str good.left(s 2)
out good

In other cases usually you use find or findrx or similar function to find a substring, and it gives correct result.
#3
a couple more options
#1 you can use this member function

Member function str.getU2
Code:
Copy      Help
function$ $sinp from nc

;Unicode version of "get" macro, it also serves left

;Error if from invalid.
;If nc < 0 or too big, gets all right part.

str s.unicode(sinp) ;;convert to UTF-16
from*2; nc*2
if(from<0 or from>s.len) end ERR_BADARG
if(nc<0 or from+nc>s.len) nc=s.len-from
this.ansi(s+from -1 nc/2)
ret this
  for your example would use like so
Function UnicodeTest1
Code:
Copy      Help
str s.getclip;
paste s
str b.getU2(s 0 1);; gets first character
paste b

#ret;;for testing place cursor on line below and run

#2 can use paste with format fields
Function UnicodeTest2
Code:
Copy      Help
str s.getclip;
paste s;
paste("%#.1s" s);; paste first character of string

#ret;;for testing place cursor on line below and run
#4
Thanks Kevin, very helpful!

Now...just for my understanding: your code seems to assume that it has string with UTF-16 chars. What if it was a UTF32 - I guess then it wouldn't work. Is there no way we can determine the # of bytes a character uses?
#5
Like in UTF8, some UTF16 characters consist of 2 normal characters. But they are very rare, used for ancient scripts etc. When working with trivial text, it is safe to ignore it.


Forum Jump:


Users browsing this thread: 2 Guest(s)