Saturday, August 14, 2010

Google never removed Oracle from its index

Some folks have been reporting a strange behavior assumed by Google after the lawsuit filed by Oracle against Android and Google: it supposedly removed oracle.com pages, and all the pages that talk about Oracle, from its search index. Even the wikipedia page on the Delphic oracle.
I initially retweeted the news and explained that it was a trick shortly after.
It would have been a low shot, really. I don't think it's even possible to remove that large set of results on all the datacenters of Google in a short time frame.

What really happened
Someone made up this query:
http://www.google.com/search?q=оrаcІе
Initially the result page was empty (Your search - ... - did not match any documents).  Then people began tweeting and sharing the query and Google started showing up them as the unique results:


So how did they do it?
At first I thought someone used a capital i (I) to substitute the L of Oracle, but Google is smart and would perform a case-insensitive search in this case:
http://www.google.com/search?q=oracIe

Nevertheless, the difference between capital i and lowercase L is not so visible in Google's font.
But, if you try to paste the link or save the page and go over it with hexedit, you'll notice this:
http://www.google.com/search?q=%D0%BEr%D0%B0c%D0%86%D0%B5
This is clearly the sign that someone has inserted non-ASCII characters in the query.
The character table for Unicode/UTF-8 says that we have, in sequence:
CYRILLIC SMALL LETTER O
LATIN SMALL LETTER R
CYRILLIC SMALL LETTER A
LATIN SMALL LETTER C
CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I 
CYRILLIC SMALL LETTER IE
This combination of characters is very unlikely to be found in actual documents. In fact, at first it did not produce results. Furthermore, in Google's font of choice, Arial, the difference between these letters and their latin counterparts (if there is any) is again not clear to the naked eye. It makes sense to reuse glyphs that are actually the same in ordinary printed text.
And finally, the forgery replaces the majority of the latin letters, because replacing only one or two would lead to a Did you mean: oracle notice.

Mystery solved
So UTF-8 struck again, and some of us were fooled by a ingenious, well-forgered Google query. Technically this is called an homograph attack.
The potential of UTF-8 as a dangerous mean of fooling users is great - imagine if non-latin URLs will become a reality. Fortunately, the ICANN and major browsers have been working on a solution, but we as web developers should be aware of the problem too.

14 comments:

Jim Harvie said...

Thanks for explaining what happened in the code. What interests me is why there were only accusations pointed at Google, and no on saying Oracle removed themselves.

Jeff Faria said...

Well-(and promptly!)-done. Not to be a stickler, but did you mean "Mystery" (not "Mistery") solved? Considering the subject of the post, you probably want to go after every single character with a fine tooth comb...

Garrett said...

Thanks for helping to clear this up.

If you feel like a laugh (or a facepalm, as the case may be), you might want to see Gene continue to defend the position that Google really did change their search results. This man just doesn't know when to stop digging.

http://www.ipwatchdog.com/2010/08/13/google-briefly-punishes-oracle-by-removal-from-google-search/

Matt said...

@Jim:

Probably because saying that Oracle *requested* themselves removed from the index is far more outrageous than the alternative.

Giorgio said...

@Garrett: Gene Quinn is arrogant and dishonest - what you could expect from a patent attorney. He probably created the case to make some publicity for himself. But when you type the URLs of the results included in his screenshot, you get to pages (like http://dvlprs.com/link/2483939) that LINK TO THE FORGERED query, with the exact cyrillic characters shown here. This is exactly why only those pages they were found, if you try now you'll find also this page.
@Jim, Matt: Oracle is a valid English word. So Oracle removed would not result in the Wikipedia pages about Greek oracles removed too, along with all the web pages containing the "oracle" word.
@Mister Snitch: thank you for finding the typo, I always confuse that and "holiday".

Giorgio said...

Since Gene Quinn tried to discredit me and the other debunkers, here's a follow-up:
http://giorgiosironi.blogspot.com/2010/08/public-response-to-gene-quinn-on-google.html

Carey Tews said...

I also think you probably meant that UTF-8 "strikes" or "struck" again, rather than "striked". Sorry :-/

Giorgio said...

Thanks. As you may know I'm not a native English speaker, and I had little time for getting this post out. I've spent much of it checking the UTF-8 character table than proofreading. :)

Doc said...

Good catch, Giorgio! Gene obviously never learned when to pack his kit and leave via the back door!

TomW said...

"imagine if non-latin URLs become a reality"
What do you mean? They already have. See http://www.bbc.co.uk/news/10100108

Giorgio said...

I did some research and notice they are now available, but the support is still not very good. There was a case study a while ago about falsifying paypal.com with Cyrillic characters, which still would be discovered by browsers (you would see something like www.--xn---.com).

Anonymous said...

I would like to exchange links with your site giorgiosironi.blogspot.com
Is this possible?

Giorgio said...

It's considered a bad practice by us web developers. Not interested, thanks.

Anonymous said...

Good evening

Can I link to this post please?

ShareThis