E D R , A S I H C RSS

Unicode



1. Unicode

In computing, Unicode provides an international standard which has the goal of providing the means to encode the text of every document people want to store on computers. This includes all scripts in active use today, many scripts known only by scholars, and symbols which do not strictly represent scripts, like mathematical, linguistic and APL symbols.

Establishing Unicode involves an ambitious project to replace existing character sets, many of them limited in size and problematic in multilingual environments. Despite technical problems and limitations, Unicode has become the most complete character set and one of the largest, and seems set to serve as the dominant encoding scheme in the internationalization of software and in multilingual environments. Many recent technologies, such as XML, the Java programming language as well as several operating systems, have adopted Unicode as an underlying scheme to represent text.
'''from wikipedia.org'''

3. thread

μš”μ¦˜ κ΄€μ‹¬μžˆλŠ” 것쀑 ν•˜λ‚˜.
ν˜„μž¬ λŒ€λΆ€λΆ„μ˜ λ¦¬λˆ…μŠ€ μ–΄ν”Œλ¦¬ μΌ€μ΄μ…˜μ€ UTF-8을 κΈ°λ°˜μœΌλ‘œν•΄μ„œ 개발되고 μžˆκ±°λ‚˜ ν¬νŒ…μ€‘μ΄λ‹€. μƒˆλ‘œ λ§Œλ“€μ–΄μ§€λŠ” λͺ¨λ“  νŒ¨ν‚€μ§€λŠ” λͺ¨λ‘ UTF-8을 μ‚¬μš©ν•œλ‹€. κ·Έλ†ˆ, KDE 와같은 μœˆλ„μš° λ§€λ‹ˆμ €λ“€λ„ 기쑴의 EUCλΌ κΈ°λ°˜μœΌλ‘œν•œ λ¬Έμžμ…‹μ„ μ§€μ›ν•˜μ§€λ§Œ, ν‘œλ©΄μ μœΌλ‘œλ§Œ 지원할뿐 λ‚΄λΆ€μ μœΌλ‘œλŠ” UTF-8둜 λ³€ν™˜ν•˜μ—¬μ„œ μ²˜λ¦¬ν•¨. κ²°κ΅­ UTF-8둜의 λ¬Έμžμ…‹ 변경은 κΈ°κ°„μ˜ λ¬Έμ œμ΄μ§€ λŒ€μ„Έμ΄λ‹€.
MultiLinugual ν”Œλž«νΌμ„ 지ν–₯ν•˜λŠ” ν”„λ‘œκ·Έλž¨μ˜ 개발자라면 λ‹Ήμ—°νžˆ μ΄ν•΄ν•΄μ•Όν•˜λŠ” νŒŒνŠΈμž„. - eternalbleu

4. κ΄€λ ¨ κΈ€

UNICODE :

http://www.unicode.org/standard/translations/korean.html

μœ λ‹ˆμ½”λ“œμ— λŒ€ν•΄ ?
μ–΄λ–€ ν”Œλž«νΌ,
μ–΄λ–€ ν”„λ‘œκ·Έλž¨,
μ–΄λ–€ 언어에도 상관없이
μœ λ‹ˆμ½”λ“œλŠ” λͺ¨λ“  λ¬Έμžμ— λŒ€ν•΄ 고유 λ²ˆν˜ΈλΌ μ œκ³΅ν•©λ‹ˆλ‹€.


UCS-2 :

λŒ€λΆ€λΆ„μ˜ ν”νžˆ μ“°λŠ” λ¬Έμžλ“€μ„ μ •μ˜ν•œ κ·œκ²©μž…λ‹ˆλ‹€.
2bytes λ²”μœ„λΌμ„œ UCS-2 μž…λ‹ˆλ‹€.
이걸 bit 둜 ν‘œν˜„ν•˜μ—¬ UTF-16 μž…λ‹ˆλ‹€.
UTF-16LE, UTF-16BE κ°€ λ™μΌν•œ 규격으둜 Little Endian, Big Endian 은 단지 byte order (λ°”μ΄νŠΈ μˆœμ„œ)κ°€ λ‹€λΌλΏ μž…λ‹ˆλ‹€.
iconv --list λΌ ν•΄λ³΄λ©΄ 쓸데없이 많이 λ‚˜μ˜€λŠ”λ°,
UTF-16LE, UCS-2LE κ°€ 같은거고 BE 끼리 같은거라고 보면 λ©λ‹ˆλ‹€.
κ·Έλƒ₯ UTF-16은 UTF-16LE 와 λ™μΌν•˜λ‚˜ μ•žμ— BOM 헀더가 λΆ™μŠ΅λ‹ˆλ‹€.
UCS-2 λŠ” 헀더가 뢙지 μ•ŠμŠ΅λ‹ˆλ‹€.

UCS-4 :

UCS-2 의 ν™•μž₯μž…λ‹ˆλ‹€.
λ’€μ˜ 2bytes λŠ” UCS-2 와 μ™„μ „νžˆ ν˜Έν™˜λ©λ‹ˆλ‹€.
즉, UCS-2 의 0xFFFF λŠ” UCS-4 의 0x0000FFFF 와 같은 μ½”λ“œμž…λ‹ˆλ‹€.
UTF-32 둜 말만 λ°”κΎΈμ–΄ μœ„μ˜ λ‚΄μš©κ³Ό λ™μΌν•©λ‹ˆλ‹€.
인터넷 λΈŒλΌμš°μ € λ‚΄λΆ€μ—μ„œ 이것이 μ‚¬μš©λ˜λ©°,
js λ“±μ—μ„œ indexOf() 둜 κ°€μ Έμ˜€λ©΄ UCS-4 μ½”λ“œκ°€ 10μ§„μˆ˜λ‘œ λ°˜ν™˜λ©λ‹ˆλ‹€.
10μ§„μˆ˜ 이λ€λ‘œ 65535 κΉŒμ§€λŠ” UCS-2 와 μ™„μ „ ν˜Έν™˜ λ©λ‹ˆλ‹€.

UTF-8 :

UCS-2, UCS-4 λŠ” μ˜μ–΄κΆŒμ—μ„œλŠ” λΆ„λͺ…ν•œ λ‚­λΉ„κ°€ μžˆμŠ΅λ‹ˆλ‹€.
ascii λ§ŒμœΌλ‘œλ„ μΆ©λΆ„νžˆ ν‘œν˜„ κ°€λŠ₯ν•œλ°, ν•œκΈ€μžμ— 쓰지도 μ•ŠλŠ” λ°”μ΄νŠΈκ°€ λ‚­λΉ„λ˜μ§€μš”.
κ·ΈλŸ°μ λ„ 있고, λ¬Έμžμ—΄λ‘œ 뭘 ν•˜κΈ°μ— UTF-7 λ³΄λ‹€λŠ” νŽΈλ¦¬ν•΄μ„œ κ°€μž₯ 보편적으둜 μ‚¬μš©λ©λ‹ˆλ‹€.
κ°€λ³€κΈΈμ΄λΌ κ°€μ§€λŠ” νŠΉμ§•μ΄ μžˆμŠ΅λ‹ˆλ‹€.
단지 κ³„μ‚°λ§ŒμœΌλ‘œ UCS-2, UCS-4 규격으둜의 μƒν˜Έλ³€ν™˜μ΄ κ°€λŠ₯ν•©λ‹ˆλ‹€.
ν•œκ΅­μ–΄λŠ” UCS-2 규격 내에 있기 λ•Œλ¬Έμ— 3bytes λ‚΄μ—μ„œ ν‘œν˜„ κ°€λŠ₯ν•©λ‹ˆλ‹€.

UTF-7 :

이메일 λ“± ascii 만으둜 ν‘œν˜„ν•΄μ•Ό ν•  ν•„μš”μ„±μ— μ˜ν•΄ λ§Œλ“€μ–΄μ‘ŒμŠ΅λ‹ˆλ‹€.
ν•œ κΈ€μžλ‹Ή 8bit μ”© ν• λ‹Ήν•˜μ§€λ§Œ μ‚¬μš©μ€ 7bit 만 ν•©λ‹ˆλ‹€.
UTF-8 κ³Ό 같은 νŠΉμ§•μ„ 가지고 μžˆμœΌλ‚˜,
λͺ¨λ“  ascii 값이 μ‹€μ œ ascii 와 같은 의λΈκ°€ μ•„λ‹ˆκΈ° λ•Œλ¬Έμ— μ΄κ²ƒμœΌλ‘œ 무엇을 ν•˜κΈ°λŠ” μ€ νž˜λ“­λ‹ˆλ‹€.


BOM (Byte Order Mark) :

μœ λ‹ˆμ½”λ“œ μ’…λ₯˜κ°€ 많기 λ•Œλ¬Έμ— μ•žμ— 이런 ν—€λ”λΌ λΆ™μ—¬μ„œ κ΅¬λΆ„ν•˜κΈ°λ„ ν•©λ‹ˆλ‹€.
EmEditor, UltraEdit, Vim λ“±μ˜ μ—λ””ν„°μ—μ„œ μΈμ‹ν•©λ‹ˆλ‹€.


μ½”λ“œν‘œ

http://www.unicode.org/charts/

각 λ‚˜λΌλ³„ μ½”λ“œλ²”μœ„μ™€ μ •μ˜λœ λ¬ΈμžλΌ λ³Ό 수 μžˆμŠ΅λ‹ˆλ‹€.
μ•žμ— 0 을 적지 μ•Šμ•˜κΈ° λ•Œλ¬Έμ— (Zerofill 이 μ•„λ‹ˆκΈ° λ•Œλ¬Έμ—) 4μžλ¦¬κΉŒμ§€λŠ” UCS-2 λ €λ‹ˆ ν•˜μ‹œκ³ ,
5 자리 λΆ€ν„°λŠ” UCS-4 λ €λ‹ˆ ν•˜μ‹­μ‹œμ˜€.

		
resy	우리말둜 된 μœ λ‹ˆμ½”λ“œ νŠœν† λ¦¬μ–Όμ΄ μžˆμ—ˆμœΌλ©΄ μ’‹κ² λ‹€ ν–ˆλŠ”λ°..
이런 μžλ£ŒλΌλ„ μ°Έ μ ˆμ‹€ν•˜κ΅°μš”. 아직도 μœ λ‹ˆμ½”λ“œ = UTF-16(or λ‹€λ₯Έ 인코딩) 으둜 μ°©κ°ν•˜λŠ” μ‚¬λžŒμ΄ λ§Žμ€ ν˜„μ‹€μ—...

λˆ„κ΅°κ°€ μ΄λŒ€λΌ λ§€κΈ΄ ν•΄μ•Όν•˜λŠ”λ°... κ±°μ°Έ... ^^:	07/13 2:23:12 μ½”λ©˜νŠΈ μ§€μš°κΈ°
		
resy	보좩 λ‚΄μš©μœΌλ‘œ...
UCS λŠ” μ½”λ“œκ°’μ˜ ν…Œμ΄λΈ”μ΄λΌκ³  μƒκ°ν•˜λ©΄ λ©λ‹ˆλ‹€. UTF λŠ” μΈμ½”λ”©μ˜ 방법(즉, λ°”μ΄νŠΈμ˜ μ—°μ†λœ μˆœμ„œλΌ μ–΄λ–»κ²Œ ν‘œν˜„ν•  것이냐 ν•˜λŠ” μ •μ˜)이고, UCS λŠ” λΈλ¦¬ μ •μ˜λ˜μ–΄ μžˆλŠ” 각 κΈ€μž μ½”λ“œλΌ ν…Œμ΄λΈ” ν™” 해놓은 κ²ƒμž…λ‹ˆλ‹€. κ°€λ Ή κΈ€μž 'κ°€' λŠ” μœ λ‹ˆμ½”λ“œμ—μ„œ U+AC00 에 ν•΄λ‹Ήν•˜λŠ”λ°, UCS2 μ—μ„œλŠ” 0xAC00 ν…Œμ΄λΈ” μ’Œν‘œμ— μœ„μΉ˜ν•˜κ³  μžˆμŠ΅λ‹ˆλ‹€. 이것을 UTF-8 μΈμ½”λ”©ν•˜λ©΄, 0xEAB080 이 λ©λ‹ˆλ‹€.

λˆ„κ΅¬λ„ 이런 μ‹μœΌλ‘œ μ„λͺ…ν•΄μ£ΌλŠ” μ‚¬λžŒμ΄ μ—†μ–΄μ„œ μ˜ˆμ „μ—” UCS2 = UTF16?? 이라고 ν—·κ°ˆλ¦¬κ³  ν—€λ§ΈλŠ”λ°, 이게 μ •ν™•ν•œ μ„λͺ…인지 λͺ¨λ₯΄κ² μŠ΅λ‹ˆλ‹€. 잘λͺ»λœ 점이 있으면 λˆ„κ°€ μ§€μ ν•΄μ£Όμ‹œκΈΈ... ^^;

문자 집합(Character Set)μ΄λž‘ 인코딩(Encoding)에 λŒ€ν•œ 차이도 뭐 μ†μ‹œμ›νžˆ κ°€λ₯΄μ³μ£ΌλŠ” 데가 μ—†λ”κ΅°μš”. κ²°κ΅­ μ‹œκ°„μ΄ μ§€λ‚˜λ‹€λ³΄λ‹ˆ 슀슀둜 μ•Œκ²Œ λ˜μ—ˆμŠ΅λ‹ˆλ‹€λ§Œ.. ν™•μ‹€νžˆ μ™Έκ΅­ 자료 λΉΌλ©΄ κ΅­λ‚΄λŠ” -_-;

κ·ΈλŸ¬κ³ λ³΄λ‹ˆ μ˜ˆμ „μ— λˆ„κ΅°κ°€κ°€ κ΅­κ°€ μ½”λ“œν‘œμ™€ 인코딩이 κ°€μ§€λŠ” 의λΈμ— λŒ€ν•΄ κΈ€ 올렸던 κ±° 같은데, locale 에 λŒ€ν•œ λ‚΄μš©μ΄ κ·Έ ν›„λ‘œ μ•ˆμ˜¬λΌμ˜€λŠ” κ±° 같기도... 	07/13 5:19:40 μ½”λ©˜νŠΈ μ§€μš°κΈ°
		
utf	utf -8의 λͺ©μ μ΄ μ• λ§€ν•˜λ„μš”. μ›λž˜ λͺ©μ μ΄ ascii문자만 μžˆλŠ” 경우 μ‚¬μš©ν•˜μ§€ μ•ŠλŠ” 첫번째 λ°”μ΄νŠΈλΌ μ—†μ• κΈ° μœ„ν•œκ²Œ μ•„λ‹™λ‹ˆλ‹€. 개발 κ³Όμ •μ—μ„œ ascii λ¬Έμžμ™€ ν˜Έν™˜μ„ μœ„ν•΄ κ·Έ μ˜μ—­μ„ κ·ΈλŒ€λ‘œ λ³€ν™˜μ΄ λ˜λ„λ‘ ν•œ κ²ƒμž…λ‹ˆλ‹€. λͺ©μ  μžμ²΄λŠ” ucs 캐릭터가 2 λ˜λŠ” 4λ°”μ΄νŠΈλ‘œ 이루어져 μžˆλŠ”λ° 이걸 슀트링으둜 μ­‰ 이어놓고 보면 쀑간에 널(0x00)이 λ“€μ–΄κ°ˆ 수 μžˆμŠ΅λ‹ˆλ‹€. μ˜ˆλΌ λ“€μ–΄ 'κ°€'λŠ” 0xac00인데 이 널 λ°”μ΄νŠΈ λ•Œλ¬Έμ— 슀트링 μ²˜λ¦¬κ°€ κ³€λž€ν•˜κ²Œ λ©λ‹ˆλ‹€. κ·Έλž˜μ„œ 널 λ°”μ΄νŠΈλΌ μ—†μ•¨ 수 μžˆλŠ” 인코딩 기법을 κ°œλ°œν•˜κ²Œ 된 κ±°μ£ .	07/13 23:22:49 μ½”λ©˜νŠΈ μ§€μš°κΈ°
		
resy	μœ—λΆ„ μ–˜κΈ°λŒ€λ‘œ μΈν„°λ„·μ—μ„œ μ•ˆμ „ν•œ μ†‘μˆ˜μ‹ μ„ μœ„ν•΄μ„œ UTF-8 의λΈκ°€ μžˆκΈ°λ„ ν•©λ‹ˆλ‹€. 널 λ¬Έμžκ°€ 듀어가지 μ•ŠμœΌλ‹ˆκΉŒμš”. λŒ€ν‘œμ μœΌλ‘œ HTTP ν”„λ‘œν† μ½œμ€ ν…μŠ€νŠΈ 방식이며, 전솑 데이터에 널 λ¬Έμžκ°€ λ“€μ–΄κ°ˆ 수 μ—†μ£ .

UTF-8 κ°œλ°œμ— λŒ€ν•œ νžˆμŠ€ν† λ¦¬λŠ” μ•„λž˜λ‘œ κ°€λ©΄ λ³Ό 수 μžˆμŠ΅λ‹ˆλ‹€.
http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt 	07/13 23:58:19 μ½”λ©˜νŠΈ μ§€μš°κΈ°
		
μ •νƒœμ˜	utf λ‹˜μ΄ μ–˜κΈ°ν•˜μ‹  뢀뢄은 utf7 λ§ŒμœΌλ‘œλ„ ν•΄κ²°λœ λ¬Έμ œμž…λ‹ˆλ‹€ :)

asc 문자 만으둜 ν•΄κ²°λ˜λŠ” λ¬Έν™”κΆŒ μ‚¬λžŒλ“€μ—κ²Œ utf16,32 λΌ λ„μž…ν•˜λΌκ³  λ§ν•΄λ΄μ§œ λ³„λ‘œ λ¨Ήνžˆμ§€λ„ μ•Šμ„ 것이고.. euc λ“±μ˜ μΈμ½”λ”©μ—μ„œ unicode 둜 λ„˜μ–΄κ°€λŠ” λ‹¨κ³„μ—μ„œμ˜ ν˜Όλž€μ„ μ€ μ„이기 μœ„ν•œ 과도기적 인코딩이라고 λ³΄λŠ”κ²Œ 더 μ˜λΈ μžˆμ§€ μ•Šμ„κΉŒ μ‹Άκ΅°μš”...

Valid XHTML 1.0! Valid CSS! powered by MoniWiki
last modified 2021-02-07 05:28:20
Processing time 0.0284 sec