For Developers‎ > ‎Design Documents‎ > ‎

IDN in Google Chrome

Background

Many years ago, domains could only consist of the Latin letters A to Z, digits, and a few other characters. Internationalized Domain Names (IDNs) were created to better support non-Latin alphabets for web users around the globe.

Different characters from different (or even the same!) languages can look very similarWe’ve seen reports of proof-of-concept attacks. These are called homograph attacksFor example, the Latin "a" looks a lot like the Cyrillic "а", so someone could register http://ebаy.com (using Cyrillic "а"), which could be confused for http://ebay.comThis is a limitation of how URLs are displayed in browsers in general, not a specific bug in Chrome.

In a perfect world, domain registrars would not allow these confusable domain names to be registered. Some TLD registrars do exactly that, mostly by restricting the characters allowed, but many do not. As a result, all browsers try to protect against homograph attacks by displaying punycode (looks like "xn-- ...") instead of the original IDN, according to an IDN policy.

This is a challenging problem space. Chrome has a global user base of billions of people around the world, many of whom are not viewing URLs with Latin letters. We want to prevent confusion, while ensuring that users across languages have a great experience in Chrome. Displaying either punycode or a visible security warning on too wide of a set of URLs would hurt web usability for people around the world.

Chrome and other browsers try to balance these needs by implementing IDN policies in a way that allows IDN to be shown for valid domains, but protects against confusable homograph attacks.

Google Safe Browsing continues to help protect over two billion devices every day by showing warnings to users when they attempt to navigate to dangerous or deceptive sites or download dangerous files. Password managers (like Google Smart Lock) continue to remember which domain password logins are for, and won’t automatically fill a password into a domain that is not the exactly correct one.

How IDN works

IDNs were devised to support arbitrary Unicode characters in hostnames in a backward-compatible way. This works by having user agents transform a hostname containing Unicode characters beyond ASCII to one fitting the traditional mold, which can then be sent on to DNS servers. For example, http://öbb.at is transformed to http://xn--bb-eka.at. The transformed form is called ASCII Compatible Encoding (ACE) made up of the four character prefix ( xn-- ) and the punycode representation of Unicode characters.

April 2017 update

Specific instances of IDN homograph attacks have been reported to Chrome, and we continually update our IDN policy to prevent against these attacks. One specific instance of this general issue was reported to Chrome security on Jan 20. It was marked a medium-severity bug. A fix landed on March 23. The researcher whose report led to the fix was awarded $2,000 under Chrome's Vulnerability Reward Program. That fix will be released in Chrome 58, which has a stable release around the end of April.

This fix is an attempt to balance the needs of our international userbase while protecting against confusable homograph attacks.. The fix uses punycode for domain names that are made entirely of Latin lookalike Cyrillic letters when the top-level domain is not an internationalized domain name, meaning that the check only applies to top-level domains like "com", "net", and "uk". We’re working on additional fixes, for example, for confusables within one script set -- “l” (lowercase L) could be confused with “I” (small dotless i character). We will keep this article updated with our current IDN policy below.

Google Chrome's IDN policy

Starting with Google Chrome 51, whether or not to show hostnames in Unicode is determined independently of the language settings (the Accept-Language list). Its algorithm is similar to what Firefox does. ( the changelist description that implemented the new policy.)

Google Chrome decides if it should show Unicode or punycode for each domain label (component) of a hostname separately. To decide if a component should be shown in Unicode, Google Chrome uses the following algorithm:
(This is implemented by IDNToUnicodeOneComponent and IsIDNComponentSafe() in components/url_formatter/url_formatter.cc)

Consequences / Examples

[The old content here was completely inaccurate and has been removed.  TODO: add examples of the above]

Behavior of other browsers

IE

IE displays URLs in IDN form if every component contains only characters of one of the languages configured in "Languages" on the "General" tab of "Internet Options", similar to what Google Chrome does.

Firefox

Firefox uses a script mixing detection algorithm based on the "Moderately Restrictive" profile of Unicode Technical Report 39. Domains of any single script, any single script + Latin, or a small whitelist of other combinations are displayed as Unicode; everything else is Punycode.

Opera

Like Firefox, Opera has a whitelist of TLDs and shows IDN only for these whitelisted TLDs.

Safari

Safari has a whitelist of scripts that do not contain confusable characters, and only shows the IDN form for whitelisted scripts. The whitelist does not include Cyrillic and Greek (they are confusable with Latin characters), so Safari will always show punycode for Russian and Greek URLs.


Comments