[Lvlug] UTF Embedding of Eight-Bit Chars
The Artist Formerly Known as Fingolfin
fingolfin at thelinuxlink.net
Wed Jun 23 21:24:52 EDT 2004
On Wed, 23 Jun 2004, Ricardo SIGNES wrote:
> * Chris Hever <fingolfin at thelinuxlink.net> [2004-06-23T21:10:23]
> > Ok. How does it discern 7-bit and multi-byte Unicode characters within
> > any given string/file/whathaveyou? That is, for something like below:
> Easy peasy. Know what a state machine is? That's how most
> parsers/tokenizers work.
> Start out having seen no byte. Look at the next byte. If it's 7-bit,
> that's the character. If it's one of the accepted "first-half" bytes
> for a multibyte Unicode character, it tells you how many more bytes to
> read. It reads that many bytes, composes one character, and then goes
> back to having seen no byte.
> (I don't know what happens if it gets a 8-bit character that isn't an
> accepted UTF-8 multibyte introduction. It's probably an error state or
That's what I was figuring on. So, I'm going to assume, since all 7-bit
characters innately do not have their high bit set, that all Unicode
'first-half' characters do. Or something...
"...Jews everywhere were showing signs of disturbance, were gathering
together, and giving evidence of great hostility to the Romans, partly
by secret and partly by overt acts."
-- Cassius Dio, Roman History 69.12.1-14.3
More information about the Lvlug