Your business might rely on the cloud or on Internet-enabled technologies—such as tablets, mobile phones, and smart devices (part of the Internet of Things) to deliver your services, analyze your data, and inform your business decisions. Ensuring that your organization’s and clients’ data remains private and secure is of the utmost importance. There are a number of tools you can use to protect the data you create and collect. These tools are known as “privacy-enhancing technologies,” or PETs.
Since the release of our last report on PETs, there have been several significant technical developments in the field. In the coming months, we’ll focus some of our Tech-Know blogs on a few of the PETs that have emerged since that report, including:
- federated learning
- differential privacy
- homomorphic encryption
- secure multiparty computation
This post examines federated learning and differential privacy. These PETs are still in the process of being refined and developed for wide-spread use, and very few organizations have implemented them.
Our upcoming blog posts will offer businesses some background information about these new PETs, and how they might be useful for better data privacy. If you hope to implement these emerging PETs in your business, we recommend following their development at academic and industry events.
Federated Learning (aka Federated Analytics)
Many businesses have decided to automate some of their processes and services, often relying on techniques from the field of artificial intelligence. Machine learning is one of the more popular techniques for analyzing data and making decisions or predictions based on that dataFootnote 1. Organizations have used machine learning for image and text recognition, amongst many other uses.
Machine learning models are usually trained using large amounts of data, which is often distributed across several data storage systems or devices. Those systems and devices can be owned by different people or organizations located in distinct jurisdictions, making it difficult if not impossible to work with directly. Federated learning is a technique that can help your business train a model across its distributed data sources while preserving privacy.
In federated learning, the original data is never shared or moved. Rather, the data stays at its original location (i.e. its source). Each federated learning system analyzes the local data in its own way, but many systems follow steps that are similar to those described in the seminal paper on the concept. A centralized model or algorithm is created, and duplicate versions of that model are sent out to each distributed data source. The duplicate model trains itself on each local data source and sends back the analysis it generates. That analysis is synthesized with the analysis from other data sources and integrated into the centralized model. This process repeats itself at a rate determined by the centralized model, for the purpose of constantly refining and improving the model.
Federated learning is considered privacy preserving for a few reasons. The original data is never shared and the aggregated insights are usually difficult to reverse engineer. Most of the sharing between the local and centralized models also involves advanced encryption methods. Although these steps offer significant privacy protections, federated learning is not a perfect solution to all data analysis projects.
One of the major challenges preventing widespread adoption of federated learning is the computational cost of regularly transferring aggregate insights back to the centralized modelFootnote 2. It can be extremely costly to transfer even small amounts of data between millions of devices. Further complicating this challenge is the diversity of data on devices. Not all data can be easily analyzed and synthesized into a central model. For example, a company that sells Android phones might struggle to synthesize data from their old and new devices because the operating system, or the underlying hardware, has changed significantly.
Researchers and organizations are working to address these challenges. For example, to address the issue of high costs, they have reduced the frequency of data transfers back to the centralized model. To address diverse data types, they have created models that do not rely on receiving data from all devices in all scenarios. Some organizations have opted to improve their federated learning systems by integrating additional computational techniques, such as fault tolerance and differential privacy.
Differential Privacy
Let’s start with an example. Imagine your business wants to tailor some of its services based on the behaviours of its customers. You have collected some personal information about their shopping and service usage habits. You want to use this information to predict general trends in future visits and purchases so that you can hire the right number of staff, stock the appropriate goods and use targeted advertisements to sell your goods. You don’t want your staff to be able to predict who specifically will be visiting, when they will visit, or what they will purchase. One privacy-preserving way to address this challenge is differential privacy.
Differential privacy offers organizations a formal method for preserving a certain amount of privacy. It is a concept that emerged within the field of cryptography, and many of its terms and methods are rooted in advanced mathematics. At its core, differential privacy involves adding a mathematically defined amount of “noise”—or fake data—to a dataset. The noise is added using an equation that makes it very difficult, if not impossible, to tell who or what was in the original dataset. Even outliers in the dataset are mathematically accounted for and obscured. This makes the dataset resistant to a number of privacy threats, including data linkage and reconstruction attacksFootnote 3.
Implementing and achieving a desirable differential privacy mechanism is not a simple process, nor is it a solution to all data privacy challenges. Every dataset is unique; the amount and types of noise that can be added to each dataset depend on what that dataset includes, as well as what its analysis is meant to reveal. For example, in 2014, researchers showed how harmful it could be to apply differential privacy to a dataset that was meant to inform medical treatments. For patients to maintain their privacy in the dataset, they would actually “be exposed to increased risk of stroke, bleeding events, and mortality”Footnote 4. Other studies have revealed similar results, and have inspired researchers to keep refining their approaches to implementing differential privacy.
Integrating differential privacy into a federated learning system can introduce additional levels of complexity. As a result, very few businesses have implemented both approaches successfully. The few use-cases that exist are found in major technology companies, including Google, Apple, and Microsoft. However, both PETs offer very promising privacy-preserving approaches for businesses. More examples will surely emerge from diverse businesses in the coming years.
Key takeaways
- Federated learning can help businesses undertake privacy-preserving data analysis across multiple devices and data sources.
- Differential privacy is one of many tools that can be used to help significantly reduce the likelihood of data linkage and reconstruction attacks.
- There has been a lot of theoretical development in federated analysis and differential privacy, but there are few use-cases in businesses due to the complexity of these PETs. Watch for more use-cases to emerge in the coming decade.