For Developers‎ > ‎Design Documents‎ > ‎

IDN in Google Chrome

Background

Back in the day, hostnames could only consist of the letters A to Z, digits, and a few other characters. Internationalized Domain Names (IDNs) were devised to support arbitrary Unicode characters in hostnames in a backward-compatible way. This works by having user agents transform a hostname containing Unicode characters beyond ASCII to one fitting the traditional mold, which can then be sent on to DNS servers. For example, http://öbb.at is transformed to http://xn--bb-eka.at. The transformed form is called ASCII Compatible Encoding (ACE) made up of the four character prefix ( xn-- ) and the punycode representation of Unicode characters.

Ideally, user agents could always display the Unicode version of a hostname. However, different characters from different languages can look very similar, and this can make phishing attacks possible. For example, the Latin "a" looks a lot like the Cyrillic "а", so someone could register http://ebаy.com (http://xn--eby-7cd.com/), which would easily be mistaken for http://ebay.com. This is called a homograph attack.

In a perfect world, domain registrars would not allow such nefarious domain names to be registered. Some TLD registrars do exactly that, mostly by restricting the characters allowed, but many do not. For some TLDs that are meant to be international, this would be nontrivial to do (e.g., .com).

As a result, all browsers try to protect against homograph attacks by displaying punycode instead of the original IDN if the hostname does not fulfill certain properties. They try to do this in a way that allows IDN to be shown for valid hostnames, but protects against phishing.

Google Chrome's IDN policy

Starting with Google Chrome 51, whether or not to show hostnames in Unicode is determined independently of the language settings (the Accept-Language list). Its algorithm is similar to what Firefox does. ( the changelist description that implemented the new policy.)

Google Chrome decides if it should show Unicode or punycode for each domain label (component) of a hostname separately. To decide if a component should be shown in Unicode, Google Chrome uses the following algorithm:
(This is implemented by IDNToUnicodeOneComponent and IsIDNComponentSafe() in components/url_formatter/url_formatter.cc)

Consequences / Examples

[The old content here was completely inaccurate and has been removed.  TODO: add examples of the above]

Behavior of other browsers

IE

IE displays URLs in IDN form if every component contains only characters of one of the languages configured in "Languages" on the "General" tab of "Internet Options", similar to what Google Chrome does.

Firefox

Firefox uses a script mixing detection algorithm based on the "Moderately Restrictive" profile of Unicode Technical Report 39. Domains of any single script, any single script + Latin, or a small whitelist of other combinations are displayed as Unicode; everything else is Punycode.

Opera

Like Firefox, Opera has a whitelist of TLDs and shows IDN only for these whitelisted TLDs.

Safari

Safari has a whitelist of scripts that do not contain confusable characters, and only shows the IDN form for whitelisted scripts. The whitelist does not include Cyrillic and Greek (they are confusable with Latin characters), so Safari will always show punycode for Russian and Greek URLs.


Comments