1. Unicode ¶

In computing, Unicode provides an international standard which has the goal of providing the means to encode the text of every document people want to store on computers. This includes all scripts in active use today, many scripts known only by scholars, and symbols which do not strictly represent scripts, like mathematical, linguistic and APL symbols.

Establishing Unicode involves an ambitious project to replace existing character sets, many of them limited in size and problematic in multilingual environments. Despite technical problems and limitations, Unicode has become the most complete character set and one of the largest, and seems set to serve as the dominant encoding scheme in the internationalization of software and in multilingual environments. Many recent technologies, such as XML, the Java programming language as well as several operating systems, have adopted Unicode as an underlying scheme to represent text.
'''from wikipedia.org'''

2. document ¶

official consortium : http://www.unicode.org
introduction : http://www.unicode.org/standard/translations/korean.html
specification : http://www.unicode.org/versions/Unicode4.1.0/
http://pluu.pe.kr/pukiwiki/

3. thread ¶

�� 관��는 것�� 나.
�� 대부�� 리�� 리 �� UTF-8�� 기반��로�� 개발되고 ��나 ��다. ��로 만들��는 모든 ��는 모두 UTF-8�� 다. 그��, KDE ��같�� 매��들�� 기�� EUC를 기반��로�� 문�� 만, ��면��로만 ��뿐 내부��로는 UTF-8로 변�� 리��. 결국 UTF-8로�� 문�� 변경�� 기�� 문�� 대��다.
MultiLinugual ��랫�� 는 ��로그램�� 개발��라면 당�� 는 ��. - eternalbleu

4. 관련 글 ¶

UNICODE :

http://www.unicode.org/standard/translations/korean.html

��������드��� 대��� ?
���떤 ���랫���,
���떤 ���로그램,
���떤 ����������� ���관������
��������드는 모든 문������ 대��� 고��� �����를 ���공�����다.


UCS-2 :

대부����� ������ ���는 문���들��� ��������� 규격�����다.
2bytes 범���라��� UCS-2 �����다.
������ bit 로 ������������ UTF-16 �����다.
UTF-16LE, UTF-16BE 가 동������ 규격���로 Little Endian, Big Endian ��� 단��� byte order (바������ ������)가 다를뿐 �����다.
iconv --list 를 ���보면 ���데������ 많��� 나���는데,
UTF-16LE, UCS-2LE 가 같������고 BE 끼리 같������라고 보면 됩��다.
그냥 UTF-16��� UTF-16LE ��� 동������나 ������ BOM ���더가 붙�����다.
UCS-2 는 ���더가 붙��� ��������다.

UCS-4 :

UCS-2 ��� �����������다.
뒤��� 2bytes 는 UCS-2 ��� ��������� ������됩��다.
���, UCS-2 ��� 0xFFFF 는 UCS-4 ��� 0x0000FFFF ��� 같��� ���드�����다.
UTF-32 로 말만 바꾸��� ������ 내���과 동��������다.
��������� 브라������ 내부������ ���것��� ������되며,
js �������� indexOf() 로 가������면 UCS-4 ���드가 10������로 반���됩��다.
10������ ���므로 65535 까���는 UCS-2 ��� ������ ������ 됩��다.

UTF-8 :

UCS-2, UCS-4 는 ������권������는 ��명��� 낭��가 ��������다.
ascii 만���로�� �������� ������ 가능���데, ���글������ �������� ���는 바������가 낭��되������.
그런����� ���고, 문������로 뭘 ���기��� UTF-7 보다는 ���리������ 가��� 보���������로 ������됩��다.
가변길���를 가���는 ��������� ��������다.
단��� �����만���로 UCS-2, UCS-4 규격���로��� ������변������ 가능�����다.
���국���는 UCS-2 규격 내��� ���기 때문��� 3bytes 내������ ������ 가능�����다.

UTF-7 :

���메��� �� ascii 만���로 ������������ ��� ������������ ������ 만들�����������다.
��� 글���당 8bit ��� ���당������만 ��������� 7bit 만 �����다.
UTF-8 과 같��� ��������� 가���고 ������나,
모든 ascii 값��� ������ ascii ��� 같��� ���미가 �����기 때문��� ���것���로 무������ ���기는 ��� ���듭��다.


BOM (Byte Order Mark) :

��������드 ���류가 많기 때문��� ������ ���런 ���더를 붙������ 구�����기�� �����다.
EmEditor, UltraEdit, Vim ����� ���디��������� �����������다.


���드���

http://www.unicode.org/charts/

각 나라�� ���드범������ ������된 문���를 볼 ��� ��������다.
������ 0 ��� ������ ������기 때문��� (Zerofill ��� �����기 때문���) 4���리까���는 UCS-2 려�� ������고,
5 ���리 부���는 UCS-4 려�� ������������.

		
resy	���리말로 된 ��������드 ������리������ ���������면 ���겠다 ���는데..
���런 ���료라�� ��� ���������군���. �������� ��������드 = UTF-16(or 다른 ������딩) ���로 ���각���는 ���람��� 많��� ���������...

���군가 ���대를 매긴 ���������는데... ������... ^^:	07/13 2:23:12 ���멘��� ������기
		
resy	보��� 내������로...
UCS 는 ���드값��� ������블���라고 ���각���면 됩��다. UTF 는 ������딩��� 방법(���, 바��������� ������된 ������를 ���떻게 ��������� 것���냐 ���는 ������)���고, UCS 는 미리 ������되��� ���는 각 글��� ���드를 ������블 ��� ���놓��� 것�����다. 가령 글��� '가' 는 ��������드������ U+AC00 ��� ���당���는데, UCS2 ������는 0xAC00 ������블 ��������� ���������고 ��������다. ���것��� UTF-8 ������딩���면, 0xEAB080 ��� 됩��다.

���구�� ���런 ������로 ���명������는 ���람��� ��������� ��������� UCS2 = UTF16?? ���라고 �����리고 ���맸는데, ���게 ��������� ���명������ 모르겠�����다. ���못된 ������ ������면 ���가 ���������������길... ^^;

문��� ������(Character Set)���랑 ������딩(Encoding)��� 대��� �������� 뭐 ������������ 가르������는 데가 ���더군���. 결국 �������� ���나다보�� ������로 ���게 되��������다만.. ��������� ���국 ���료 빼면 국내는 -_-;

그러고보�� ��������� ���군가가 국가 ���드������ ������딩��� 가���는 ���미��� 대��� 글 ���렸던 ��� 같���데, locale ��� 대��� 내������ 그 ���로 ������라���는 ��� 같기��... 	07/13 5:19:40 ���멘��� ������기
		
utf	utf -8��� 목������ ���매���������. ���래 목������ ascii문���만 ���는 경��� ������������ ���는 �������� 바������를 ������기 ������게 ���닙��다. 개발 과��������� ascii 문������ ��������� ������ 그 ��������� 그대로 변������ 되��록 ��� 것�����다. 목��� ������는 ucs ���릭���가 2 또는 4바������로 ���루������ ���는데 ������ ������링���로 ��� ������놓고 보면 �������� ���(0x00)��� 들����� ��� ��������다. ���를 들��� '가'는 0xac00���데 ��� ��� 바������ 때문��� ������링 ���리가 곤란���게 됩��다. 그래��� ��� 바������를 ������ ��� ���는 ������딩 기법��� 개발���게 된 ������.	07/13 23:22:49 ���멘��� ������기
		
resy	����� ���기대로 ��������������� ��������� ������������ ��������� UTF-8 ���미가 ���기�� �����다. ��� 문���가 들���가��� ��������까���. 대���������로 HTTP ���로��������� ��������� 방������며, ������ 데��������� ��� 문���가 들����� ��� ������.

UTF-8 개발��� 대��� ���������리는 ���래로 가면 볼 ��� ��������다.
http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt 	07/13 23:58:19 ���멘��� ������기
		
���������	utf 님��� ���기������ 부����� utf7 만���로�� ���결된 문��������다 :)

asc 문��� 만���로 ���결되는 문���권 ���람들���게 utf16,32 를 ��������라고 말���봐��� ��로 먹�������� ������ 것���고.. euc ����� ������딩������ unicode 로 ������가는 단����������� ���란��� ��� ������기 ������ 과��기��� ������딩���라고 보는게 더 ���미 ������ ������까 ���군���...

5. ��고 �� ¶

http://www.joelonsoftware.com/articles/Unicode.html

Unicode

Contents

1. Unicode ¶

2. document ¶

3. thread ¶

4. 관련 글 ¶

5. ���고 ��������� ¶

5. ��고 �� ¶