For Developers‎ > ‎Design Documents‎ > ‎

IDN in Google Chrome

Background

Many years ago, domains could only consist of the Latin letters A to Z, digits, and a few other characters. Internationalized Domain Names (IDNs) were created to better support non-Latin alphabets for web users around the globe.

Different characters from different (or even the same!) languages can look very similarWe’ve seen reports of proof-of-concept attacks. These are called homograph attacksFor example, the Latin "a" looks a lot like the Cyrillic "а", so someone could register http://ebаy.com (using Cyrillic "а"), which could be confused for http://ebay.comThis is a limitation of how URLs are displayed in browsers in general, not a specific bug in Chrome.

In a perfect world, domain registrars would not allow these confusable domain names to be registered. Some TLD registrars do exactly that, mostly by restricting the characters allowed, but many do not. As a result, all browsers try to protect against homograph attacks by displaying punycode (looks like "xn-- ...") instead of the original IDN, according to an IDN policy.

This is a challenging problem space. Chrome has a global user base of billions of people around the world, many of whom are not viewing URLs with Latin letters. We want to prevent confusion, while ensuring that users across languages have a great experience in Chrome. Displaying either punycode or a visible security warning on too wide of a set of URLs would hurt web usability for people around the world.

Chrome and other browsers try to balance these needs by implementing IDN policies in a way that allows IDN to be shown for valid domains, but protects against confusable homograph attacks.

Google Safe Browsing continues to help protect over two billion devices every day by showing warnings to users when they attempt to navigate to dangerous or deceptive sites or download dangerous files. Password managers (like Google Smart Lock) continue to remember which domain password logins are for, and won’t automatically fill a password into a domain that is not the exactly correct one.

How IDN works

IDNs were devised to support arbitrary Unicode characters in hostnames in a backward-compatible way. This works by having user agents transform a hostname containing Unicode characters beyond ASCII to one fitting the traditional mold, which can then be sent on to DNS servers. For example, http://öbb.at is transformed to http://xn--bb-eka.at. The transformed form is called ASCII Compatible Encoding (ACE) made up of the four character prefix ( xn-- ) and the punycode representation of Unicode characters.

Google Chrome's IDN policy

Starting with Google Chrome 51, whether or not to show hostnames in Unicode is determined independently of the language settings (the Accept-Language list). Its algorithm is similar to what Firefox does. ( the changelist description that implemented the new policy.)

Google Chrome decides if it should show Unicode or punycode for each domain label (component) of a hostname separately. To decide if a component should be shown in Unicode, Google Chrome uses the following algorithm:
  • Convert each component stored in the ACE to Unicode per UTS 46 transitional processing (ToUnicode).
  • If there is an error in ToUnicode conversion (e.g. contains disallowed charactersstarts with a combining mark, or violates BiDi rules), punycode is displayed.
  • If any character is outside the union of the following sets, punycode is displayed.
  • If the component contains either U+0338 or U+2027, punycode is displayed.
  • If the component uses characters drawn from multiple scripts, it is subject to a script mixing check based on "Moderately Restrictive" profile of UTS 39 (starting with Chome 63, it will be "Highly Restrictive") with an additional restriction on Latin. Failing the check, the component is shown in punycode. 
    • Latin, Cyrillic or Greek characters cannot be mixed with each other
    • Up to Chrome 62, Latin characters in the ASCII range can be mixed with characters in another script as long as it's not Greek nor Cyrillic.
    • Starting with Chrome 63, Latin characters in the ASCII range can be mixed ONLY with Chinese (Han, Bopomofo), Japanese (Kanji, Katakana, Hiragana), or Korean (Hangul, Hanja). 
    • Han (CJK Ideographs) can be mixed with Bopomofo 
    • Han can be mixed with Hiragana and Katakana
    • Han can be mixed with Korean Hangul 
  • If two or more numbering systems (e.g. European digits + Bengali digits) are mixed, punycode is shown.
  • If there are any invisible characters (e.g. a sequence of the same combining mark or a sequence of Kana combining marks), punycode is shown. 
  • Test the label for mixed script confusable per UTS 39. If mixed script confusable is detected, show punycode.
  • If a hostname belongs to an non-IDN TLD(top-level-domain) such as 'com', 'net', or 'uk' and all the letters in a given label belong to a set of Cyrillic letters that look like Latin letters (e.g. Cyrillic Small Letter IE - е  ), show punycode.
  • If the label matches a dangerous pattern, punycode is shown.
  • If the end of a hostname is identical to one of top 10k domains after removing diacritic marks and mapping each character to its spoofing skeleton (e.g. www.googlé.com with 'é' in place of 'e'), punycode is shown.  
  • Otherwise, Unicode is shown. 
(This is implemented by IDNToUnicodeOneComponent and IsIDNComponentSafe() in components/url_formatter/url_formatter.cc and IDNSpoofChecker class in components/url_formatter/idn_spoof_checker.cc )

April 2017 update

Specific instances of IDN homograph attacks have been reported to Chrome, and we continually update our IDN policy to prevent against these attacks. One specific instance of this general issue was reported to Chrome security on Jan 20, 2017. It was triaged as a medium-severity bug, and a fix landed on in Chrome 58 on March 23. The researcher whose report led to the fix was awarded $2,000 under Chrome's Vulnerability Reward Program.

This fix is an attempt to balance the needs of our international userbase while protecting against confusable homograph attacks. The fix shows punycode for domain names that are made entirely of Latin lookalike Cyrillic letters when the top-level domain is not an internationalized domain name, meaning that the check only applies to top-level domains like "com", "net", and "uk" but not applied for IDN TLDs like рф. We’re working on additional fixes, for example, for confusables within one script set -- “l” (lowercase L) could be confused with “I” (small dotless i character). We will keep this article updated with our current IDN policy above.

Consequences / Examples

[The old content here was completely inaccurate and has been removed.  TODO: add examples of the above]

Behavior of other browsers

IE

IE displays URLs in IDN form if every component contains only characters of one of the languages configured in "Languages" on the "General" tab of "Internet Options", similar to how Google Chrome worked prior to version 51.

Firefox

Firefox uses a script mixing detection algorithm based on the "Moderately Restrictive" profile of Unicode Technical Report 39. Domains of any single script, any single script + Latin, or a small whitelist of other combinations are displayed as Unicode; everything else is Punycode.

Opera

Like Firefox, Opera has a whitelist of TLDs and shows IDN only for these whitelisted TLDs.

Safari

Safari has a whitelist of scripts that do not contain confusable characters, and only shows the IDN form for whitelisted scripts. The whitelist does not include Cyrillic and Greek (they are confusable with Latin characters), so Safari will always show punycode for Russian and Greek URLs.


Comments