[Lvlug] UTF Embedding of Eight-Bit Chars

The Artist Formerly Known as Fingolfin fingolfin at thelinuxlink.net
Wed Jun 23 21:24:52 EDT 2004


On Wed, 23 Jun 2004, Ricardo SIGNES wrote:

> * Chris Hever <fingolfin at thelinuxlink.net> [2004-06-23T21:10:23]
> >
> > Ok. How does it discern 7-bit and multi-byte Unicode characters within
> > any given string/file/whathaveyou? That is, for something like below:
>
> Easy peasy.  Know what a state machine is?  That's how most
> parsers/tokenizers work.
>
> Start out having seen no byte.  Look at the next byte.  If it's 7-bit,
> that's the character.  If it's one of the accepted "first-half" bytes
> for a multibyte Unicode character, it tells you how many more bytes to
> read.  It reads that many bytes, composes one character, and then goes
> back to having seen no byte.
>
> (I don't know what happens if it gets a 8-bit character that isn't an
> accepted UTF-8 multibyte introduction.  It's probably an error state or
> something.)

That's what I was figuring on. So, I'm going to assume, since all 7-bit
characters innately do not have their high bit set, that all Unicode
'first-half' characters do. Or something...

-- 
"...Jews everywhere were showing signs of disturbance, were gathering
together, and giving evidence of great hostility to the Romans, partly
by secret and partly by overt acts."

  -- Cassius Dio, Roman History 69.12.1-14.3


More information about the Lvlug mailing list