How A 1200-Year-Old Hacking Technique Can Already Crack Tomorrow's Encrypted Vaults

In the ninth century, Baghdad was not the violent epicentre of a conflict between Western and Eastern ideologies it would become once Bush and Blair sent the troops to Iraq. It was, rather, the centre of the civilized world and home to a renowned scientific hub, Bayt al-Hikma, known in English as the House of Wisdom, a research academy that gathered manuscripts from around the world, primarily from Greece, Persia and India.

In some cases, where the languages were dead or unknown, or the texts were enciphered, Arab scholars working at the centre had to decipher them, developing cryptanalytic techniques in the process. This was, according to the ‘Origins of Cryptology: The Arab Contributions’ in Cryptologia, where the first documented "attacks", as we call them today, were carried out on encrypted texts.

Born in Baghdad, al-Kindi, known as ‘The Philosopher of the Arabs’, was the first director of the Library and Translation Academy at the House of Wisdom. He is credited as author of the oldest surviving document on cryptology, A Manuscript on Deciphering Cryptographic Messages. And in that manuscript was the kernel of an idea that is still taught in crypto classrooms today: the use of letter frequency statistics to unlock ciphers.

The concept is, by today’s standards, simple. One looks at the most used characters in a ciphertext and makes educated guesses as to what that letter would be. In English, for instance, the most commonly used letter is ‘E’. So where the letter 'A' is substituted for 'E' in a basic ciphertext, 'A' will likely appear more times than any other character. Using informed guesses based on the statistics at hand, it’s a case of working out what has been substituted where until the original text reveals itself.

Such frequency attacks sparked the imagination of Microsoft security researcher Seny Kamara during his undergraduate years at Purdue University, Indiana. More than 10 years on, and after some extensive training in cryptoland, he’s found the same techniques are, worryingly, working on "state-of-the-art" cryptographic tools. In a paper released today, Kamara, along with co-researchers Muhammad Naveed from the University of Illinois and Charles Wright from Portland State University, have shown how they can grab revealing medical information from hospital databases when they're protected by some of the most advanced encryption available.

They looked at searchable encrypted databases (EDBs), where operations can be performed on encrypted data without ever needing to use or share a key for decryption. With current systems, anyone who wants to analyse encrypted data has to unlock it first, leaving information open to hackers. But that's not the case with EDB, which was something of a holy grail when it became a possibility in the last decade. With no key available to the cloud providers or the internet companies, it should be close to impossible for them, or hackers sitting on their systems, to get to the information.

There are various proposed methods for doing this, the most attractive one being fully homomorphic encryption, where one can carry out any mathematical task on any piece of encrypted data without ever revealing the hidden text. The problem with FHE, as proposed by IBM's Craig Gentry in 2009, is that it sucks up an extraordinary amount of compute power. It’s yet to be determined whether homomorphic can put to practical use.

There is a via media, however, in the form of CryptDB, created by MIT scientists in 2011. The software keeps it simple, allowing basic functions to operate on SQL databases, the most common form of database on the planet used by scores of websites and business systems.

CryptDB is, as Shrek might put it, like an onion. It combines old forms of encryption, some of which allow certain kinds of calculation to take place on them. By covering the database in different layers of encryption, CryptDB allows the user to peel away at the “onion” to get to the layers of encryption that allow those mathematical operations (the ones that make up search or analysis algorithms, for instance). Once that operation is performed, put the peel back on and the database is thoroughly wrapped up in its multi-layered protective shield again. At all points the data remains scrambled.

Such was the hype around CryptDB, it has informed the design of some major initiatives, including Google's Encrypted BigQuery and SAP's SEEED database services. But, as was admitted to at the time of CryptDB’s birth, when the layers are peeled away and queries made, data is leaked. That’s where al-Kindi’s ideas and Kamara's hacks, carried out from a simple Apple MacBook, come into play.

Kamara and his fellow researchers started by downloading real patient data from 200 U.S. hospitals, taken from the the National Inpatient Sample (NIS) database of the Healthcare Cost and Utilization Project (HCUP), which anyone can acquire by paying out. They then set about determining if their “inference attacks”, where leaked data is combined with publicly available information to infer the plain text, would work against the kinds of encryption used by CryptDB.

In particular, they targeted two types of the “property preserving elements” of CryptoDB. Those properties are necessary to allow for operations on data whilst keeping it scrambled, but subsequently leak vital information. First was the order preserving encryption (OPE) scheme, which encrypts a set of messages in such a way that their ciphertexts reveal the order of the messages. Second was the deterministic encryption (DTE) scheme, which reveals whether scrambled data types are equal or not.

Taking al-Kindi’s 1200-year-old technique and using it on DTE-protected columns in the database, it was possible to look at the scrambled versions of the medical information to see what blobs of encrypted data occurred most often. They then compared the statistics with those of two freely-available auxiliary datasets - the 2004 version of HCUP that included a different collection of information from mostly different hospitals in the 2009 version, and the Texas Inpatient Public Use Data File (PUDF), provided by the Texas Department of State Health Services. The comparison allowed them to make informed guesses at what lay behind the ciphertexts for a vast quantity of information.

The results were alarming. Their guesses uncovered the correct mortality risk and patient death attributes for 100 per cent of patients in at least 99 per cent of the 200 largest hospitals included in the HCUP data. They recovered the disease severity for 100 per cent of the patients for at least 51 per cent of the same hospitals.

They didn’t stop with al-Kindi’s method, though. Another simple attack revealed a lot of sensitive medical information. The “sorting” method takes advantage of the order left by OPE. Kamara explains the basic technique: “If I give you column encrypted using this order preserving stuff and suppose you have a range of one to 10 and all numbers are encrypted and appear once in this column, all that’s needed is to take a column and sort. Then I know the ciphertext that is the smallest corresponds to number one. It’s just kind of there.”

Putting that “really dumb” method into practice on the medical data proved devastatingly effective, recovering the admission month and mortality risk of 100 per cent of patients for at least 90 per cent of the 200 largest hospitals.

The researchers also created two of their own, somewhat more complex attacks. The results were similar, deciphering large portions of the databases. To learn more about those, read the full paper linked here.

Hackers hoping to exploit these techniques would first have to gain access to the server on which the database was held. They would then have to wait for queries to be made on the vault to get to the right layer and successfully expose the information.

Where they want names, the hackers could use de-anonymization attacks, again correlating auxiliary datasets to expose whose details they had uncovered. Kamara, Naveed and Wright didn’t go as far as de-anonymising the owners of the data, however, in part because the rules on accessing the HCUP information barred them from doing that, but also because they believed they'd proven their point: CryptDB and crypto of its ilk is not as secure as had been assumed.

“CryptDB seems like a magic bullet, but what was missed with all the excitement about it was that there are some real trade offs here,” said Kamara. “People who are into crypto know this and feel uneasy about it in some sense, but nobody has explicitly shown how much leakage there was.”

Professor Ross Anderson, a crypto luminary from the University of Cambridge Computer Laboratory, believes the research proves what he always thought: such schemes aren’t worth using at all. "Hopefully nobody will be dumb enough to rely on such schemes to protect real data. The bad news is that dozens of cryptography researchers have spent years of their lives building this stuff."

One of the creators of CryptDB, Raluca Ada Popa, said she did not believe the findings proved CryptDB weak as the flawed pieces of the software were not designed to handle sensitive information. She said OPE encryption should be used for "high-entropy values" where the order does not reveal much and that CryptDB was still a worthy way to protect information. "This is how the CryptDB paper says it should be used. Saying that CryptDB is broken when used for something it was not designed for is straight incorrect - and thousands of security systems would not work in this case," said Popa, who has set up her own startup, Prevail, based on a follow-up to the technology, Mylar, though it doesn't focus on databases, but web applications.

"Prevail and none of CryptDB followers are affected because they either use the order encryption scheme in a correct way (for the right types of data), or do not use it. Everyone I was in touch with that used CryptDB was careful about the use of OPE."

She said that database administrators can specify which data fields are sensitive, and CryptDB ensures those fields are encrypted with strong schemes, not the weaker ones. Admins are warned the vulnerable modes are only suitable for data fields that have high entropy. Database administrators should therefore be careful when using certain kinds of information in CryptDB.

Nevertheless, the race is on to find the right solution for tinkering with encrypted data without needing keys. “We are closer and closer to good homomorphic systems but it is really too early to be sure about a specific system or implementation. The industrial pressure is very strong, too strong, and it is dangerous to predict an exact for a good system in such applications,” says professor Jean-Jacques Quisquater, from the Université Catholique de Louvain Crypto Group.

That Kamara and others are poking holes in modern systems should also comfort people rather than worry them. “It doesn’t mean that the quest for cloud based encryption schemes is pointless. Quite the opposite. But expect them to be tested by people like me who like to break them, and some of us have been a long time and haven’t forgotten those old tricks,” adds Professor Alan Woodward, crypto expert from the UK’s University of Surrey.

The mathematical descendants of al-Kindi will be guiding us, with any luck, to a more secure future.

This story was updated at 1.35pm ET to include the rebuttal from CryptDB co-creator Raluca Ada Popa. Kamara later blogged a rebuttal to that rebuttal.

More From Forbes

How A 1200-Year-Old Hacking Technique Can Already Crack Tomorrow's Encrypted Vaults