Operative temperature: Difference between revisions

From formulasearchengine
Jump to navigation Jump to search
en>Gamewizard71
 
en>Addbot
m Bot: Migrating 2 interwiki links, now provided by Wikidata on d:q3517831
Line 1: Line 1:
{{redirect|BOCU}}
'''Binary Ordered Compression for Unicode''' ('''BOCU''') is a [[MIME]] compatible Unicode compression scheme. BOCU-1 combines the wide applicability of [[UTF-8]] with the compactness of [[Standard Compression Scheme for Unicode]] (SCSU). This [[Unicode]] [[character encoding|encoding]] is designed to be useful for compressing short strings, and maintains code point order. BOCU-1 is specified in a Unicode Technical Note.<ref>{{cite web |url=http://www.unicode.org/notes/tn6/#Introduction |title=UTN #6: BOCU-1|date=2006-02-04 |author=Markus Scherer, [[Mark Davis (Unicode)|Mark Davis]] |accessdate=2008-05-18}}</ref>
For comparison SCSU was adopted as standard Unicode compression scheme with a byte/code point ratio similar to language-specific [[code page]]s. SCSU has not been widely adopted, as it is not suitable for MIME “text” media types. For example, SCSU cannot be used directly in emails and similar protocols. SCSU requires a complicated encoder design for good performance. Usually, the [[ZIP file format|zip]], [[bzip2]], and other industry standard algorithms compact larger amounts of Unicode text more efficiently.<ref>{{cite web |url=http://unicode.org/notes/tn14 |title=UTN #14: A survey of Unicode compression
|date=2004-01-30 |first=Doug |last=Ewell |accessdate=2008-06-13 |format=PDF }}</ref>


Both SCSU<ref>[http://www.iana.org/assignments/charset-reg/SCSU IANA registration record for SCSU]</ref> and BOCU-1<ref>[http://www.iana.org/assignments/charset-reg/BOCU-1 IANA registration record for BOCU-1]</ref> are [[Internet Assigned Numbers Authority|IANA]] registered charsets.


Ahmad just what his wife loves to call him and he totally loves this identity. Interviewing is what he does for a living. Hawaii is where she's been living for many she will never move. I am really fond of playing badminton and  casino onlines guide I would never give up. If [http://Www.answers.com/topic/identify identify] to discover more away my website: http://vincesifuente.buzznet.com/user/journal/18678702/happy-online-casinos-not/<br><br>Review my webpage :: [http://vincesifuente.buzznet.com/user/journal/18678702/happy-online-casinos-not/ internett casino]
== Details ==
 
All numbers in this section are [[hexadecimal]], and all ranges are inclusive.
 
Code points from <code>U+0000</code> to <code>U+0020</code> are encoded in BOCU-1 as the corresponding byte value. All other code points (that is, <code>U+0021</code> through <code>U+D7FF</code> and <code>U+E000</code> through <code>U+10FFFF</code>) are encoded as a difference between the code point and a normalized version of the most recently encoded code point that was not an ASCII space (<code>U+0020</code>). The initial state is <code>U+0040</code>. The normalization mapping is as follows:
 
{| class="wikitable"
! style="width: auto;" | Code range
! style="width: auto;" | Normalized code point
! style="width: auto;" | Notes
|-
| <code>U+3040</code> to <code>U+309F</code>
| <code>U+3070</code>
| [[Hiragana]]
|-
| <code>U+4E00</code> to <code>U+9FA5</code>
| <code>U+7711</code>
| [[Unihan]]
|-
| <code>U+AC00</code> to <code>U+D7A3</code>
| <code>U+C1D1</code>
| [[Hangul]]
|-
| <code>U+0020</code>
! <small>encoder state kept as is</small>
| Space
|-
| <code>U+''hhhh''00</code> to <code>U+''hhhh''7F</code><br /><small>(excluding ranges above)</small>
| <code>U+hhhh40</code>
| middle<br />of 128
|-
| <code>U+''hhhh''80</code> to <code>U+''hhhh''FF</code><br /><small>(excluding ranges above)</small>
| <code>U+''hhhh''C0</code>
| middle<br />of 128
|}
 
The difference between the current code point and the normalized previous code point is encoded as follows:
 
{| class="wikitable"
! style="width: auto;" | Difference range
! style="width: auto;" | Byte sequence range<br><small>(see below)</small>
|-
| <code>-10FF9F</code> to <code>-2DD0D</code>
| <code>21</code> <code>F0</code> <code>58</code> <code>D9</code> to <code>21</code> <code>FF</code> <code>FF</code> <code>FF</code>
|-
| <code>-2DD0C</code> to <code>-2912</code>
| <code>22</code> <code>01</code> <code>01</code> to <code>24</code> <code>FF</code> <code>FF</code>
|-
| <code>-2911</code> to <code>-41</code>
| <code>25</code> <code>01</code> to <code>4F</code> <code>FF</code>
|-
| <code>-40</code> to <code>3F</code>
| <code>50</code> to <code>CF</code>
|-
| <code>40</code> to <code>2910</code>
| <code>D0</code> <code>01</code> to <code>FA</code> <code>FF</code>
|-
| <code>2911</code> to <code>2DD0B</code>
| <code>FB</code> <code>01</code> <code>01</code> to <code>FD</code> <code>FF</code> <code>FF</code>
|-
| <code>2DD0C</code> to <code>10FFBF</code>
| <code>FE</code> <code>01</code> <code>01</code> <code>01</code> to <code>FE</code> <code>19</code> <code>B4</code> <code>54</code>
|}
 
Each byte range is [[lexicographical order|lexicographically ordered]] with the following thirteen byte values excluded: <code>00 07 08 09 0A 0B 0C 0D 0E 0F 1A 1B 20</code>. For example, the byte sequence <code>FC 06 FF</code>, coding for a difference of <code>1156B</code>, is immediately followed by the byte sequence <code>FC 10 01</code>, coding for a difference of <code>1156C</code>.
 
Any ASCII input <code>U+0000</code> to <code>U+007F</code> excluding space <code>U+0020</code> resets the encoder to <code>U+0040</code>. Because the above mentioned values cover line end code points <code>U+000D</code> and <code>U+000A</code> ''as is'' (<code>0D 0A</code>), the encoder is in a known state at the begin of each line. The corruption of a single byte therefore affects at most one line.  For comparison, the corruption of a single byte in [[UTF-8]] affects at most one code point, for [[Standard Compression Scheme for Unicode|SCSU]] it can affect the entire document.
 
BOCU-1 offers a similar robustness also for input texts without the above mentioned values with the special reset code <code>0xFF</code>. When a decoder finds this octet it resets its state to <code>U+0040</code> as for a line end. The use of <code>0xFF</code> reset bytes is not recommended in the BOCU-1 specification, because it conflicts with other BOCU-1 design goals, notably the ''binary order''.
 
The optional use of a signature [[Byte-order mark|<code>U+FEFF</code>]] at the begin of BOCU-1 encoded texts, i.e. the BOCU-1 byte sequence <code>FB EE 28</code>, changes the initial state <code>U+0040</code> to <code>U+FE80</code>. In other words the signature cannot simply be stripped as in most other Unicode encoding schemes. Adding a reset byte after the signature (<code>FB EE 28 FF</code>) could avoid this effect, but the BOCU-1 specification does not recommend this practice.
 
In theory [[UTF-1]] and [[UTF-8]] could encode the original [[Universal Character Set|UCS-4]] set with 31 bits up to <code>7FFFFFFF</code>. BOCU-1 and [[UTF-16]] can encode
the modern [[Unicode]] set from <code>U+0000</code> to <code>U+10FFFF</code>.  Excluding the thirteen ''protected'' code points encoded as single octets BOCU-1 can use <math>256 - 13 = 243</math> octets in multi-byte encodings. BOCU-1 needs at most four bytes consisting of a lead byte and one to three trail bytes. The trail bytes encode a remaining "[[modulo operation|modulo]] 243" (base 243) difference, the lead byte determines the number of trail bytes and an initial difference.
Note that the reset byte <code>0xFF</code> is not ''protected'' and can occur as trail byte.
 
== Patent ==
 
The general BOCU algorithm is covered by [[United States patent law|United States Patent]] #6,737,994, which also mentions the specific BOCU-1 implementation.<ref>{{cite web |url=http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.htm&r=1&f=G&l=50&s1=6737994.PN.&OS=PN/6737994&RS=PN/6737994 |title=United States Patent #6,737,994, “Binary-ordered compression for unicode” |date=2004-05-18 |author=[[Mark Davis (Unicode)|Davis]], et al. |accessdate=2008-11-16}}</ref> [[IBM]], which employed both of the inventors of BOCU-1 at the time it was created, states in the Unicode Technical Note that implementers of a "fully compliant version of BOCU-1" must contact IBM to request a royalty-free license.<ref>{{cite web |url=http://www.unicode.org/notes/tn6/#Intellectual_Property |title=UTN #6: BOCU-1|date=2006-02-04 |author=Markus Scherer, [[Mark Davis (Unicode)|Mark Davis]] |accessdate=2008-11-16}}</ref> BOCU-1 is the only Unicode compression scheme described on the Unicode Web site that is known to be encumbered with [[intellectual property]] restrictions.
 
By contrast, IBM also filed for a patent on [[UTF-EBCDIC]], but it chose in that case to make the documentation and [[Character_encoding#Modern_encoding_model|encoding scheme]] “freely available to anyone concerned towards making the transformation format as part of the UCS standards,” instead of requiring implementers to request a license.<ref>{{cite web |url=http://www.unicode.org/reports/tr16/#Bibliography |title=UTR #16: UTF-EBCDIC|date=2002-04-16 |author=V.S. Umamaheswaran |accessdate=2008-11-16}}</ref>
 
== References ==
{{reflist}}
 
== See also ==
* [[UTF-1]] contains a comparison of the UTF-1, [[UTF-8]], and BOCU-1 designs
* [[International Components for Unicode]] A library that can convert between BOCU-1 and other Unicode encodings
 
{{Unicode navigation}}
{{character encoding}}
 
{{DEFAULTSORT:Binary Ordered Compression For Unicode}}
[[Category:Data compression]]
[[Category:Unicode Transformation Formats]]

Revision as of 01:57, 14 March 2013

Name: Jodi Junker
My age: 32
Country: Netherlands
Home town: Oudkarspel
Post code: 1724 Xg
Street: Waterlelie 22

my page - www.hostgator1centcoupon.info Binary Ordered Compression for Unicode (BOCU) is a MIME compatible Unicode compression scheme. BOCU-1 combines the wide applicability of UTF-8 with the compactness of Standard Compression Scheme for Unicode (SCSU). This Unicode encoding is designed to be useful for compressing short strings, and maintains code point order. BOCU-1 is specified in a Unicode Technical Note.[1]

For comparison SCSU was adopted as standard Unicode compression scheme with a byte/code point ratio similar to language-specific code pages. SCSU has not been widely adopted, as it is not suitable for MIME “text” media types. For example, SCSU cannot be used directly in emails and similar protocols. SCSU requires a complicated encoder design for good performance. Usually, the zip, bzip2, and other industry standard algorithms compact larger amounts of Unicode text more efficiently.[2]

Both SCSU[3] and BOCU-1[4] are IANA registered charsets.

Details

All numbers in this section are hexadecimal, and all ranges are inclusive.

Code points from U+0000 to U+0020 are encoded in BOCU-1 as the corresponding byte value. All other code points (that is, U+0021 through U+D7FF and U+E000 through U+10FFFF) are encoded as a difference between the code point and a normalized version of the most recently encoded code point that was not an ASCII space (U+0020). The initial state is U+0040. The normalization mapping is as follows:

Code range Normalized code point Notes
U+3040 to U+309F U+3070 Hiragana
U+4E00 to U+9FA5 U+7711 Unihan
U+AC00 to U+D7A3 U+C1D1 Hangul
U+0020 encoder state kept as is Space
U+hhhh00 to U+hhhh7F
(excluding ranges above)
U+hhhh40 middle
of 128
U+hhhh80 to U+hhhhFF
(excluding ranges above)
U+hhhhC0 middle
of 128

The difference between the current code point and the normalized previous code point is encoded as follows:

Difference range Byte sequence range
(see below)
-10FF9F to -2DD0D 21 F0 58 D9 to 21 FF FF FF
-2DD0C to -2912 22 01 01 to 24 FF FF
-2911 to -41 25 01 to 4F FF
-40 to 3F 50 to CF
40 to 2910 D0 01 to FA FF
2911 to 2DD0B FB 01 01 to FD FF FF
2DD0C to 10FFBF FE 01 01 01 to FE 19 B4 54

Each byte range is lexicographically ordered with the following thirteen byte values excluded: 00 07 08 09 0A 0B 0C 0D 0E 0F 1A 1B 20. For example, the byte sequence FC 06 FF, coding for a difference of 1156B, is immediately followed by the byte sequence FC 10 01, coding for a difference of 1156C.

Any ASCII input U+0000 to U+007F excluding space U+0020 resets the encoder to U+0040. Because the above mentioned values cover line end code points U+000D and U+000A as is (0D 0A), the encoder is in a known state at the begin of each line. The corruption of a single byte therefore affects at most one line. For comparison, the corruption of a single byte in UTF-8 affects at most one code point, for SCSU it can affect the entire document.

BOCU-1 offers a similar robustness also for input texts without the above mentioned values with the special reset code 0xFF. When a decoder finds this octet it resets its state to U+0040 as for a line end. The use of 0xFF reset bytes is not recommended in the BOCU-1 specification, because it conflicts with other BOCU-1 design goals, notably the binary order.

The optional use of a signature U+FEFF at the begin of BOCU-1 encoded texts, i.e. the BOCU-1 byte sequence FB EE 28, changes the initial state U+0040 to U+FE80. In other words the signature cannot simply be stripped as in most other Unicode encoding schemes. Adding a reset byte after the signature (FB EE 28 FF) could avoid this effect, but the BOCU-1 specification does not recommend this practice.

In theory UTF-1 and UTF-8 could encode the original UCS-4 set with 31 bits up to 7FFFFFFF. BOCU-1 and UTF-16 can encode the modern Unicode set from U+0000 to U+10FFFF. Excluding the thirteen protected code points encoded as single octets BOCU-1 can use 25613=243 octets in multi-byte encodings. BOCU-1 needs at most four bytes consisting of a lead byte and one to three trail bytes. The trail bytes encode a remaining "modulo 243" (base 243) difference, the lead byte determines the number of trail bytes and an initial difference. Note that the reset byte 0xFF is not protected and can occur as trail byte.

Patent

The general BOCU algorithm is covered by United States Patent #6,737,994, which also mentions the specific BOCU-1 implementation.[5] IBM, which employed both of the inventors of BOCU-1 at the time it was created, states in the Unicode Technical Note that implementers of a "fully compliant version of BOCU-1" must contact IBM to request a royalty-free license.[6] BOCU-1 is the only Unicode compression scheme described on the Unicode Web site that is known to be encumbered with intellectual property restrictions.

By contrast, IBM also filed for a patent on UTF-EBCDIC, but it chose in that case to make the documentation and encoding scheme “freely available to anyone concerned towards making the transformation format as part of the UCS standards,” instead of requiring implementers to request a license.[7]

References

43 year old Petroleum Engineer Harry from Deep River, usually spends time with hobbies and interests like renting movies, property developers in singapore new condominium and vehicle racing. Constantly enjoys going to destinations like Camino Real de Tierra Adentro.

See also

Template:Unicode navigation Template:Character encoding