Top.Mail.Ru
Unicode — Postmypost

Unicode

Nikiforov Alexander
Friend of clients
Back

Contents

What is Unicode?

Unicode is a character encoding standard that encompasses characters from almost all languages in the world. This standard allows computers to process textual information and display it correctly on the screen. All information in a computer is stored and processed in binary format, that is, in the form of sequences of zeros and ones. To convert such binary sequences into characters understandable to users, special encodings have been developed that establish rules by which each character—be it a letter, a number, or even a musical note—receives a unique numeric code.

For a computer to correctly display characters on the screen, it must know which binary code corresponds to them. For example, the binary sequence 0100 0001 corresponds to the Latin letter A. However, the number of possible codes is limited, which is why Unicode operates on a different principle. Each character is assigned a code point—a unique numeric value that has the form U+XXXX. The prefix U+ indicates Unicode, while XXXX represents the hexadecimal value of the character.

In the hexadecimal system, 16 symbols are used, including the digits from 0 to 9 and the letters from A to F, which represent the numbers from 10 to 15. For example, the English letter A corresponds to the code point U+0041, and the word HELLO corresponds to the code points U+0048, U+0065, U+006C, U+006C, U+006F. Each code point is then converted into a binary format understandable to the computer and stored in its memory. Interestingly, emojis are also encoded using Unicode.

Why is Unicode necessary?

Initially, there were separate encodings for the characters of each language, many of which were incompatible with each other. This led to the problem of "garbled text," where strange symbols or hieroglyphs appeared on the screen instead of normal text. For example, if a girl named Masha from Russia sent an e-mail with the word Привет to her friend in Armenia, he might receive the message instead as ?????. The original text was lost because the computers of the sender and receiver supported different encodings.

Unicode was created to solve this problem by providing a unified method of representing characters and simplifying work with text on a multilingual level. With Unicode, you can send an email or post text on a website even in Klingon—a constructed language developed by linguist Marc Okrand for the Star Trek universe. Recipients will be able to see it in its original readable form. As of today, Unicode includes approximately 150,000 characters, which is sufficient to cover almost all writing systems.

Other encodings: what existed before Unicode

Before the advent of Unicode, there were many different encodings, each designed for a specific language. The most well-known are ASCII, KOI8-R, and Windows-1251.

  • ASCII: This is a character encoding table that includes 127 characters, such as Latin letters, digits, and punctuation marks. However, ASCII does not support Cyrillic characters, and when attempting to encode text in Russian, users received a set of question marks, losing the original text.
  • KOI8-R: This encoding was developed to output characters not only from the Latin alphabet. KOI8-R is compatible with ASCII and includes 256 characters, allowing for the display of Cyrillic.
  • Windows-1251: Another encoding that supports Cyrillic. Nevertheless, characters can be encoded differently in different encodings. For example, the character Г may have different representations in KOI8-R and Windows-1251.

The problem of incompatibility of different encodings manifested itself when text encoded with one encoding was decoded using another. Additionally, there were serious limitations on the number of characters in older encodings: ASCII has 127 characters, while Windows-1251 has 256. For many less popular languages, encodings simply did not exist. Unicode, on the other hand, offers a universal solution, including all permissible characters and the rules for their encoding.