now that strings can contain UTF-8 codepoints and null chars, the b_dict
api has been enhanced to accept keys as b_strings as well as regular c-strings.
keys are now stored as b_strings internally, to allow a wider range of
keys to be used.
rather than always interpreting a b_hashmap_key as a buffer to be hashed,
b_hashmap can now be told to consider the value of the key_data pointer itself
as the key, treating it as a buffer of size sizeof(void*).
b_string now uses UTF-8 internally, and can correctly manipulate strings
that contain non-ASCII and multi-byte codepoints.
b_string now tracks the length of a string in both bytes and unicode codepoints.
string insertion functions have been updated to correctly handle strings with
multi-byte codepoints, so the index parameter of each function now refers to codepoints
rather than bytes. inserting single-byte chars into a string with no multi-byte codepoints
is still optimised to used array indexing and memmove.
a b_string_iterator has been added to simplify iterating through a UTF-8 string, without
having to use a charAt()-style interface that would incur performance penalties.
strings can now also contain null bytes.
new functions include:
- b_string_tokenise: a b_iterator interface for iterating through tokens
in a string. similar to strtok except that:
* it is re-entrant, and uses no global state.
* it supports delimiters that are longer than one character and/or contain
multi-byte UTF-8 codepoints.
* it doesn't modify the string that is being iterated over.
* it correctly handles strings with multi-byte UTF-8 codepoints and null chars.
- b_string_compare: for comparing strings. necessary to use this rather than strcpy
as b_strings can now contain null chars.
the cursor can only be moved during uncompressed i/o, and any read/write operations are performed directly on the underlying endpoint with no buffering, and don't count towards the transacted byte statistics.
the cursor can only be moved once, after which it's position must be restored.