Investigation into the collection and use of de-identified mobility data in the course of the COVID-19 pandemic
Complaints under the Privacy Act
May 29, 2023
Description
The investigation examined whether mobility data collected and used by PHAC in its response to the pandemic contains personal information as defined under Section 3 of the Privacy Act (the Act). Specifically, whether PHAC and its data providers have implemented de-identification techniques and safeguards against re-identification that are deemed sufficient to reduce the risk of an individual being identified below the "serious possibility" threshold.
Takeaways
- Data de-identification and aggregation are two privacy-enhancing techniques that are useful for privacy protection if they reduce the risk of re-identification of individuals below acceptable thresholds.
- De-identification alone is generally insufficient to ensure data anonymization. It must be accompanied by additional safeguards against re-identification.
- Data aggregation must involve a sufficient number of individuals to reasonably reduce the risk of singling out individuals.
- Transparency about the purposes for which personal information is collected and used is crucial to maintaining trust between individuals and organizations that collect and use their data.
Report of Findings
Overview
The Office of the Privacy Commissioner of Canada received 12 complaints under the Privacy Act (the “Act”) against Public Health Agency of Canada (“PHAC”) and Health Canada (“HC”) regarding the collection and use of Canadians’ mobility data, which is comprised of geolocation data collected over time and other associated information.
The complainants allege that PHAC secretly collected data on 33 million mobile devices during the COVID-19 pandemic, and that according to a request for proposal, published in December 2021, it planned to continue to collect Canadians’ mobility data over the ensuing five years.
PHAC reported that it has effectively relied on mobility data of just under 14 million Canadians to gain insightful information and meaningful analysis on the movement of populations in Canada, which has assisted in tracking the spread of the COVID-19 virus and for planning, assessing and adjusting the government’s response to the pandemic.
PHAC claimed that it relied only on de-identified and aggregated data and that it never collected or used any personal identifiable information and thus the Privacy Act does not apply.
Through our investigation, as a necessary analytical condition, we first examined whether mobility data collected and used by PHAC in its response to the pandemic contains personal information as defined under Section 3 of the Act. More specifically, we assessed whether there was a serious possibility, in the circumstances, that an individual could be identified using the mobility data, procured by PHAC, alone or in combination with other available information. Our investigation did not assess whether or not PHAC’s data providers collected and used location data in compliance with privacy laws.
Following analyses of the representations received and review of information on this topic and the concept of identification, we have concluded that the combination of the de-identification measures and the safeguards against re-identification implemented by PHAC and its data providers has reduced the risk of identifying individuals below the “serious possibility” threshold. We therefore consider the complaints in this matter to be not well-founded.
Notwithstanding our investigation’s conclusion that PHAC did not contravene the Privacy Act with regard to the collection and use of mobility data in the course of the COVID-19 pandemic, we have made a number of recommendations to PHAC in particular, with instructive relevance to all organizations that produce, use or procure de-identified information in the course of their activities. We are encouraged that PHAC has accepted our recommendations.
Background
- On December 31, 2019, a novel coronavirus, COVID-19, was reported in Wuhan in the Chinese province of Hubei. COVID-19 is a very contagious virus that may cause severe and fatal respiratory illness. On March 11, 2020, the World Health Organization (“WHO”) declared COVID-19 as a global pandemic.
- According to health experts, the COVID-19 virus spreads mainly via inhalation of infectious respiratory droplets, known as aerosols, that are released by infected people who are in proximity. PHAC officials determined that gaining “mobility insights” on population movements, interactions and gatherings would assist in understanding how the virus may spread and proliferate.
- Mobility insights are also useful in planning, monitoring, and refining/assessing the effectiveness of certain key measures that are implemented by health authorities to combat the pandemic (stay at home, quarantine, lockdowns, etc.). For example, the number of trips between cities is an indicator of how connected these cities are and therefore the likelihood that an outbreak in one will spread to the other.
- Mobility insights collected by PHAC were derived from data that PHAC indicated was de-identified and aggregated information about the movements of individuals over time (mobility data). This information was deduced from location-data that is continuously produced by devices/equipment that are often at the same physical proximity as their users. The most common examples of these devices/equipment are cell phones and other devices with data plans.
- PHAC, like certain of its international counterparts, collected mobility-data based insights in its response to the COVID-19 pandemic. To that end, it collected insights aggregated from two types of data streams:
- Mobile cell-tower/operator data, which comprises records created each time a mobile phone pings an operator’s cell-tower. PHAC procured this type of data from the telecom operator TELUS and leveraged the data analytics expertise of the Communications Research Centre Canada (“CRC”) who was processing TELUS Data to generate mobility reports to provide aggregated data and statistics to PHAC’s scientists for analysis.
- Mobile geolocation data, which is information about the geographic location transmitted by a mobile application installed on a mobile device, using the device’s built-in GPS capabilities. PHAC acquired this category of data from a private company named BlueDot, which in turn procured it from two data providers: Pelmorex and Veraset. During the first months of pandemic, Health Canada set up the initial contract with BlueDot to assist PHAC which undertook the actual collection of data. The contract was subsequently transferred to PHAC and Health Canada was not involved in any other manner in this project.
- TELUS informed the OPC of TELUS’ “Data for Good” program and CRC informed our office about PHAC’s intent to use data in a de-identified and aggregated form in Canada’s response to the pandemic. OPC offered the services of its Business Advisory and Government Advisory directorates to review the technical means used to de-identify data and provide advice. CRC and TELUS did not follow up on OPC’s offer.
- PHAC’s Privacy Management Division (“PMD”) also conducted a privacy analysis on September 22, 2020, in order to identify any potential privacy risks associated with the use of TELUS mobility data and the publication of the derived insights. It concluded that the data that PHAC would be receiving from TELUS is not about identifiable individuals because of the de-identification and aggregation processes it would undergo and that therefore the Privacy Act does not apply. It subsequently entered into a contract with TELUS and BlueDot to procure de-identified aggregated mobility data that it used to derive insights on the movement of Canadians.
- On December 17, 2021, a few months after the contract with TELUS had expired, PHAC published a Request For ProposalFootnote 1 (“RFP”) to acquire mobile operator data to continue to leverage this category of data for mobility insights. Following the publication of the RFPFootnote 2 on public procurement website, media articles raised privacy concerns relating to the use of mobility data, and the OPC subsequently received 12 complaints.
- On January 13, 2022, the Standing Committee on Access to Information, Privacy and Ethics (“ETHI”) adopted a motion to undertake a study on the collection and use of mobility data by the Government of Canada. The ETHI Committee issued a corresponding report in May 2022 and recommended therein, amongst other things, that government agencies be transparent when they harness the potential of big data in their activities and that federal privacy laws be modernized to adequately address the use of de-identified and aggregated data.
Analysis
Issue: PHAC did not collect personal information as defined under the Act
- Section 3 of the Privacy Act defines personal information as information “about” an “identifiable” individual.
- In Gordon v. Canada (Health), 2008 FC 258, the Federal Court decided that information will be about an identifiable individual where there is a serious possibility that an individual could be identified through the use of that information, alone or in combination with other available information.Footnote 3
- Therefore, we considered the degree to which data collected by PHAC can be linked to identifiable individuals either directly or indirectly, through inference and/or in association with other data sources. For the reasons described below, we concluded that due to the de-identification of the data, and the suite of protections used in this case to reduce the risk of re-identification, there is no serious possibility that the information collected by PHAC, and CRC on its behalf, can identify any individual.
- To that end, we examined each data stream separately and for each one, we analyzed both: (i) data that CRC, on PHAC’s behalf, was able to access in the data providers’ systems and (ii) data CRC and PHAC were able to download and store in their own systems. This segmentation was required because the degree of de-identification and safeguards against re-identification are different in each data stream and at each phase.
- In our investigative analysis, we relied on: (i) representations received from PHAC, (ii) information and guidance on anonymization and (iii) the work of the ETHI Standing Committee. We also sought and received representations from TELUS, BlueDot and CRC as relevant third parties to the investigation.
De-identification and residual risk of re-identification
- De-identification may encompass a residual risk of re-identification of individuals that depends on many factors. Such factors are either: (i) intrinsic to the data itself and the de-identification techniques; or (ii) external and depend on sub-factors including:
- the availability of additional data that can be cross-checked with the de-identified data;
- who has access to the dataset and for what purposes, their motivation to re-identify data and their knowledge that a specific individuals’ information is included in the dataset; and
- the expertise and the resources used in the re-identification process.
- In fact, multiple studies and research have succeeded in re-identifying data sets that were publicly released in de-identified format. This includes the Netflix studyFootnote 4 and the AOL data releaseFootnote 5.
- In the Netflix study, researchers demonstrated that an adversary who has access to discrete information points about an individual can easily identify his/her records in the Netflix movie prize database which contained subscribers’ movie rating. In the AOL search release example, it is demonstrated that simply removing users’ identifiers may not be sufficient to properly anonymize data.
- Moreover, risk of re-identification is not a static consideration and may increase over time with the improvement of re-identification techniques and the availability of additional resources and data that may be linked to the de-identified dataset.
- For the foregoing reasons, it remains a complex exercise to definitively quantify the risk of re-identification. Several examples in the literature propose calculation methods which are not deterministic but rather rely on probabilistic calculations, based on assumptions about several factors and the type of re-identification cyber-attack.
The Privacy Act does not include specific provisions on de-identified or anonymized data
- The Privacy Act does not expressly address de-identified or anonymized data. Its provisions all apply equally to the collection, use and disclosure, by federal institutions subject to the Act, of any information that meets the test of being “personal information”. Therefore, the first issue, in this case, is to determine whether PHAC (including CRC acting on its behalf), collected any information that meets this test for being ‘personal information’ described above. If it does, then it would be necessary to consider whether the collection and any subsequent use or disclosure were compliant with the provisions of the Privacy Act. If it does not meet the test for ‘personal information’ then the Privacy Act does not apply.
- As a note, our office has called for legislative change to bring a more nuanced approach to the handling and governance of de-identified information to respect both the potentially privacy protective nature of using de-identified data, and the inherent risks of re-identification. Given the importance and instructive value, we have explored the related issues in more depth in the “Other” section of this report.
Does access to data at TELUS’ system constitute ‘collection’ under the Act?
- As a preliminary matter we considered what constitutes ‘collection’ for the purposes of the Act in this case. As noted in the background section, CRC, on behalf of PHAC, had access to view certain individual-level data on TELUS’ system, but could only download aggregated data.
- It is clear that when a copy of information is saved in the institution’s information management systems (i.e. in emails in an employee’s inbox, its document management system, in hard copy, etc.) the information has been ‘collected’. Similarly, if information is saved on another platform, but in an account under the control of an employee of an institution acting in a professional capacity, it is clearly ‘collected’ by that institution. In a situation where an institution’s employee, in the course of work, sees (or hears) information but does not retain a physical or virtual copy, it may be less clear if it is ‘collected’ by the institution. In the present case, PHAC and CRC officials accessed data from TELUS, but also recorded aggregated information resulting from their queries.
- Even if information is seen but not collected, there may nevertheless be a subsequent ‘use’ of that information for the purposes of the Privacy Act. This can happen where information is simply reviewed – for instance, where an individual’s ID is visually inspected to ensure they are 18 before allowing access to a space, or in this case, where individual level data is reviewed to design appropriate parameters for downloadable aggregated information. In other words, in our view, the information within TELUS’ systems reviewed by CRC on PHAC’s behalf for the purpose of designing aggregated download is not automatically out of scope of the Privacy Act.
- In order to determine whether this information, and the information subsequently downloaded in aggregate form, constitutes personal information we considered research, guidance and standards of practice with respect to de-identification and other protections against the identification of individuals. These sources included the Treasury Board Secretariat’s Privacy Implementation Notice 2020-03, other industry standards in the health field, and research specific to mobility data.
Determining Adequate Protection Against Risk of Re-identification
- For the purpose of this report, “De-identification” means a process whereby any personal identifiers, such as names, phone numbers or device IDs in a mobility data context, are stripped from the data about a specific individual (often replaced with a randomly assigned identifier).
- In our view, based on current research, for mobility data, de-identification alone is insufficient to render data ‘non-personal’ and outside the scope of the Privacy Act. ‘Mobility Data’ represents data that reveals the geographic location of where a person or device has been at multiple points in time. Depending on the circumstances, such data can be used to infer information about a device user, such as their place of home or work. This could in turn be compared to other readily available information to link the de-identified data to an identifiable individual and then glean information about where else they have been. Towards accuracy of technology, a studyFootnote 6 conducted on 1.5 million users of a mobile phone operator in a western country concluded that four spatio-temporal points are enough to uniquely identify 95% of the individuals because mobility traces are highly unique and consistent.
- We would therefore expect that for an organization to avoid collecting personal information in a mobility data context, it would need to ensure sufficient additional protections against re-identification are in place in addition to de-identification measures.
- There are a range of different types of protections against re-identification, and new techniques may also be developed in future. Two common types of protections, both of which were used in this case, are: (i) contractual and physical protections on access and use, and (ii) aggregation.
- Contractual and physical protections on access and use reduce the risk of re-identification of de-identified data by limiting the number of individuals/organizations who could have the opportunity to attempt re-identification, and by limiting the likelihood those individuals will attempt re-identification.
- Aggregation reduces the risk of re-identification by combining data about multiple individuals together so that any one individual’s own data is obscured.
- Generally speaking, in order for information to be considered ‘non-personal’ and therefore outside of the scope of the Privacy Act the following conditions would need to be met:
- where an institution has access to properly de-identified mobility data there would need to be robust contractual and physical protections in place to: (a) limit access to that data to a limited number of individuals, and (b) limit the purposes for which individuals are permitted to use the data (i.e. to not allow re-identification attempts). For (b) this should include, at a minimum, a contractual requirement to not attempt re-identification, and safeguard controls such as audit capability and monitoring of data access/use to guard against unauthorized re-identification attempts.
- where an institution has access to aggregated mobility data, (a) a sufficient number of individuals would be aggregated in each ‘cell’ to reasonably reduce the risk of extrapolating the data of a single individual (in accordance with current statistical guidelines or expert advice), and (b) access to the aggregated information would be controlled as above for access to de-identified data.
- Regarding recommended cell sizes, the Treasury Board Secretariat’s Privacy Implementation Notice 2020-03 (Protecting privacy when releasing information about a small number of individuals), states “there is no minimum cell size that is appropriate for all data releases, and Treasury Board of Canada Secretariat policies do not specify a mandatory minimum cell size. However, the following best practices may serve as a starting point for a case-by-case analysis: A minimum cell size of 10 is often cited as a best practice for public data releases of data that is less sensitive, while a minimum cell size of 20 is cited for more sensitive data”.
- The next sections will illustrate how PHAC and its data providers applied the foregoing safeguards in each data stream they relied on to derive mobility insights.
Data stream 1: Mobile cell-tower/operator data
- Interaction between cell phones and telecom cell towers is critical to the functioning of the telecom network and serving mobile users. Indeed, all cell phones regularly generate and transmit data to a nearby telecom cell tower when they connect to it or use operator’s mobile services, for sending or receiving calls, texting, browsing the internet, etc. The frequency of interaction depends on phone usage. Normally, a phone sends a message when it gets close to a new cell tower, when its connection status changes and when it needs to establish connections to access mobile services. Most phones will also send limited messages to the cell tower when they are stationary and idle.
- Consequently, Telecom operators can collect and record information about their clients’ location and movement (i.e. SIM id, timestamp and location of the tower serving the client) because cell towers have precise latitude and longitude coordinates that make it possible to infer the location and movement of the cell phones they are serving and interacting with.
- TELUS stated its appreciation for the potential value of mobility data, including in combatting a global health crisis such as the COVID-19 pandemic. Since 2015, they had commenced the development of a data analytics platform, called TELUS Insights, designed to generate actionable intelligence from de-identified client mobility data.
- Following the outbreak of the COVID-19 pandemic, TELUS launched, in April 2020, a program named “Data for Good” that operates on the TELUS Privacy-by-Design certified insights platform. PHAC chose to leverage TELUS’ program, signing a contract with TELUS on February 10, 2021. It later signed a Memorandum of Understanding (“MOU”) with CRC on July 05, 2021, to capitalize on CRC’s expertise to conduct mobility analysis using location data. Both the contract and the MOU expired on October 8, 2021.
- TELUS and CRC advised our Office that data within the TELUS Insights Platform does not indicate precisely where an individual device may be located since it is derived using the location of the cell towers rather than the geographic location of the mobile devices. Information about movement is inferred when a mobile device switches from one cell tower area to another and ends when the mobile device remains connected to the same tower for longer than 30 seconds. Thus, depending on cell tower coverage in the area, location data is estimated within a physical diameter of between 70km (in rural areas) to the smallest possible diameter of 100m. That said, sometimes, it is possible to determine device’s location with more precision given the fact that cell towers that serve data are different from those that serve voice. With users often accessing both services, they can be more precisely located when in the range of two or more cell towers.
Prior De-identification
- All direct identifiers (MSISDNFootnote 7, IMEIFootnote 8, IMSIFootnote 9) in TELUS insights are removed or transformed by TELUS before third parties, including CRC on behalf of PHAC, access the data – so that the data within the Insights platform cannot be linked back to an individual. More specifically, each identifier is hashed more than once using SHA 256 hashing, which is a hash encrypting function that transforms input data ("message"), regardless of its size, into a fixed number of digits, known as the "hash," "digest" or "digital fingerprint." It is considered a one-way function because it is nearly impossible to turn the digest back into the original data.
- Data accessible on TELUS’ platform relates to 9 million devices and consists of: hashed device identifiers, timestamps, device country and area code, network cell identifier (identifies the sector of the cellular tower that the device was connected to), time the device was first and last seen on the cell tower, the duration of connection to this cell tower, and the approximate geographic coordinates of the cell tower.
Access control & data minimization
- TELUS advised that the de-identified data within the Insights Platform is robustly safeguarded with physical, administrative, and technical controls, including Virtual Private Cloud Service controls to ensure access is restricted to authorized users as well as regular vulnerability scans, including logging and monitoring of activity on the Insights Platform. TELUS also explained that the ingested de-identified mobility data in the Insights Platform is temporally spatialized by at least 15 minutes and that it reviews each request, including use case, to access by a ‘Data for Good’ client to determine what “data views” will be made available to the authorized data scientists.
- In PHAC’s case, five CRC employees and two PHAC employees were authorized to access device-level data at the TELUS insights platform, using a two-factor authentication system.
The enclave modelFootnote 10
- In the enclave model, data may be kept in some kind of segregated enclave that restricts the export of the original data, and instead accepts queries from qualified researchers, runs the queries on the de-identified data, and responds with resultsFootnote 11.
- TELUS uses a similar model – information at the device-level, even though it is de-identified, cannot be copied outside the TELUS platform. Rather, based on mobility insights reports that PHAC is interested in, CRC runs the corresponding queries on the device-level de-identified data. The aggregated results generated by the queries are then stored in a table, hosted at TELUS’ cloud.
- Access and use of TELUS insights platform is guided and supervised. Thus, prior to approving the transfer of the aggregated data in the generated table to CRC’s cloud and subsequently to PHAC’s cloud, TELUS reviews the generated data to ensure that required safeguard levels against re-identification are met.
Aggregation
- TELUS’ review includes a determination that only aggregated counts are included. For example, data is sorted by Forward Sortation Area (FSA versus full postal code) or Census Canada Dissemination Area, and only aggregated counts relating to more than 20 devices are included. PHAC’s representations to our Office confirmed that results are aggregated geographically (at least at the census sub-division), temporally (at least over 24 hours) and for a minimum of 20 devices.
- Aggregated data imported by PHAC/CRC is subsequently used to calculate different mobility indicators that reflect population movement over 24-hour period for different geographic areas (province/territory, health region, census metropolitan area, or census sub-division). Examples of mobility indicators include aggregated percentiles of maximum distance travelled, maximum distance travelled far from home, total distance travelled, percentage of time at or within fixed distance of home and percentages of devices that travelled between different geographic areas.
Safeguards against re-identification in TELUS’ data stream
- TELUS, CRC and PHAC included multiple safeguards in this data stream to reduce the risk of re-identification of mobility data used by PHAC during the COVID-19 pandemic, namely:
- Prior De-identification: TELUS encrypts all device IDs prior to populating the insights platform with its customers’ mobility data. Therefore, information that CRC accessed, on behalf of PHAC, in the TELUS insights platform did not contain any direct identifiers.
- Aggregation: CRC is restricted to importing only aggregated data from TELUS insights platform and no data at the device level, even though it is de-identified, can be copied outside of TELUS platform. More specifically, data that CRC imported on behalf of PHAC was aggregated spatially, at least at the census sub-division level, temporally to span at least over 24-hours period and with cells that contains at least 20 devices. This minimum cell size is compliant with TBS’ guidance in the subject matter. Further, it is above the minimum threshold (11) that was determined in the expert reportFootnote 12 in a caseFootnote 13 before the court that dealt with risk of re-identification.
- Release model: TELUS’ Data for Good is a non-public data release which limits the availability of the data set to select number of identified recipients. As a condition of receiving the data, recipients must agree to terms and conditions regarding the privacy and security of the data set out in a data sharing agreement. The terms in this case required users at PHAC and CRC who had access to the data to not attempt to re-identify it.
Additionally, access to the TELUS insights platform is guided and supervised. TELUS reviewed requests to extract/download derived insights to ensure that privacy rules were met and to further mitigate any risk of re-identification prior to authorizing their export. Finally, only authorized and select employees from CRC and PHAC can access and use mobility data provided by TELUS, and TELUS established monitoring controls to review access activity and downloading logs to ensure compliance with policies and protocols. - Contractual clauses: Both PHAC and TELUS included in the contract governing their commercial relationship binding provisions to use only de-identified information. Specifically, in the contract’s statement of work that PHAC addressed to TELUS, TELUS was required to provide PHAC with access to de-identified information that ensures data anonymization in order to generate aggregate indicators and insights on the mobility of individuals in Canada.
Similarly, TELUS data sharing terms stipulates that PHAC must not use the derived data for any other purpose except for the one specified in the contract and that it may not correlate, associate, link or combine any of the derived data with other data sources, except as set out in an exhibit consented to in writing by TELUS. Further, these terms require PHAC to ensure that none of its representatives attempts to relate the derived data to any identifiable individual. TELUS confirmed that it allowed PHAC to correlate downloaded aggregated mobility data only with census data at the health region and FSA levels.
Data stream 2: Mobile geolocation data
- In addition to cell-tower based data; PHAC relied on other sources to derive insights on the mobility of Canadians. Specifically, it used geolocation data, that is generally collected using mobile apps, GPS tracking Software Development Kits (SDKs), Bluetooth, Geotagged social media posts, etc.
- Mobile applications offer a wide range of private sector services (weather, fitness, emails, maps, etc.). Certain apps collect the phone’s geolocation data in real time using the phone’s built-in GPS system – which is then available to app operator and could be disclosed by the operator to third parties.
- PHAC signed a contract with BlueDot to procure geolocation data between March 26, 2020 and March 20, 2022. BlueDot, in turn acquired this category of data from two providers: 1- Pelmorex Corp, a Canadian weather information and media company that collects location data via its free mobile app such as the Weather Network app, and 2- Veraset LLC, an American company that sells raw and processed movement data that it acquires from third parties, other aggregators, SDK’s and direct app relationships.
- Under Canadian private sector privacy law, both the collection of phone geolocation information and any subsequent disclosures of personally identifiable data to third parties will generally require the user’s valid consent. Pelmorex directly collects geolocation data (timestamp of the collection, geolocation coordinates, and pseudonymized user ID) from its app users. According to BlueDot, Pelmorex obtains users’ consent for collection of this data (which it uses to deliver the app’s weather-related services) and only discloses aggregated mobility data to BlueDot, not any individual personal information.
Prior De-identification
- According to its privacy policy, Veraset collects information from third parties it describes as ‘trusted’. That said, it does not define or elaborate on how the ‘trusted’ third parties obtain users’ consent ahead of the collection.
- Veraset does provide individual-level data to BlueDot – however, it claims to first strip the individual-level data of direct personal identifiers and includes a requirement in its contract with BlueDot that it not attempt to re-identify the individuals.
Aggregation
- Once received by BlueDot, certain pre-aggregated data is sent directly to PHAC without further processing, whereas other data, mainly information at the device level, is aggregated by BlueDot spatially at the census tract or census sub-division (“CSD”) geographic unit and/or temporally over a 24-hour period before sending it to PHAC.
- Examples of the mobility insights that PHAC receives from BlueDot to estimate contact rates among Canadians include: (i) the number of devices at certain points of interest (parks, hospitals, retail stores, etc.), (ii) aggregated statistics on distance travelled around primary location of devices, (iii) movement between geographic regions within Canada and (iv) traffic originating from USA and other countries.
- BlueDot advised that its agreement with PHAC stipulates that data provided to the health agency will be aggregated but without specifying any minimum cell size for the aggregated data (minimum number of devices that should be in each indicator). It explained, nevertheless, that it decided in April 2020 to follow Statistics Canada’s precedent of excluding data based on less than 5 measurements and that on January 17, 2022, PHAC asked that indicators that were based on less than 20 devices be excluded.
Access control
- Aggregated data from the BlueDot data stream is either uploaded directly to PHAC’s cloud or included into written reports that are sent by email to PHAC. Authorized individuals from PHAC can also access similar information via a “Mobility Dashboard” that was developed by BlueDot.
Safeguards against re-identification in BlueDot’s data stream
- According to BlueDot’s representations and data samples, the information flow from this data stream comprises several layers that increase the level of data de-identification and therefore reduces the associated risk of re-identification, namely:
- Prior de-identification: BlueDot did not receive any information that can be linked directly to identifiable individuals as data is either: (i) pre-aggregated or (ii) includes only a hashed device ID when it is at the device level.
- Aggregation: PHAC receives from BlueDot only aggregated data and never data at the device level. More specifically, data provided to PHAC consisted of: (i) the number of devices that visited certain points of interest, (ii) mobility indicators (percentile of maximum and total travelled distance, percentage of time at and away from home) within census units, health regions and provinces, (iii) the number of devices that travelled between two Canadian health regions and (iv) the number of devices that arrived in Canada from a global epidemic hotspot. All the previous statistics were calculated over a 24-hour period and for cells whose minimum size was 5 devices and subsequently 20 devices, as of January 18, 2022. As explained above, this minimum cell size is compliant with TBS’ guidance in the subject matter and above the minimum threshold (11) that was determined in the expert report in a case before the courts that dealt with risk of re-identification.
- Contractual clauses: The statement of work that BlueDot received from PHAC requires BlueDot to “anonymously” analyze data at its disposal to help the health agency address specific questions related to social distancing, self-isolation at home, movements to/from healthcare institutions across the country, in addition to other analytics related to dispersion of COVID-19. BlueDot has also specified that it is contractually forbidden from attempting to re-identify data it receives at the device level from Veraset.
- Release model: PHAC was not given access to raw data at BlueDot’s system. Instead, BlueDot prepared and uploaded aggregated datasets to PHAC’s cloud and provided it with a weekly/biweekly report. BlueDot’s mobility dashboard that PHAC could access contains the same mobility indicators that were shared with PHAC either via email or through an upload to the cloud. Also, only approved PHAC employees can access the dataset uploaded to its cloud system.
Safeguards in both data streams that reduce the serious possibility the risk to identify individuals
- De-identification is widely recognized, across the globe, including by our office, to be a potential tool to assist in protecting individuals’ privacy while realizing the benefits associated with big data. This was of particular relevance during the current pandemic, given the benefits of mobility insights to understand and curb the spread of COVID-19.
- To that end, in the case under this investigation, TELUS and Veraset relied on a robust algorithm (SHA 256) to hash direct identifiers and de-identify data at the device-level. Furthermore, access to the granular information at the device-level by PHAC/CRC is either not allowed (BlueDot data stream) or supervised and controlled (TELUS data stream).
- In both data streams, information under PHAC’s and/or CRC’s control has been aggregated according to several criteria, either temporally and/or spatially, with minimum cell sizes between 5 and 20, a minimum size that is accepted and recommended by many experts in Canada, which increased the degree of data anonymity of the datasets under PHAC’s or CRC’s control.
- Further, only select employees from PHAC/CRC were authorized to access mobility data either at the device-level, on a ‘view only’ basis, or in aggregated form.
- Finally, PHAC and its epidemiologists in this specific project were looking for macro trends on the population movement and were not engaged in contact-tracing. Therefore, PHAC had no motivation to re-identify data and this is expressly reflected in its contracts and the RFP related to this matter.
- As explained in detail in paragraph 32, in this case, in order for information to be considered ‘non-personal’ for the purposes of collection of properly de-identified data by a federal institution under the Privacy Act, the following conditions must be met: (i) robust contractual and physical protections are in place and (ii) acceptable data aggregation levels and access controls exist.
- In light of the above, the accepted practices in this field and the measures taken against the risk of PHAC identifying any individual, we find that there is not a serious possibility that the information PHAC collected could be used to identify any individual. Therefore, the complaints are not well-founded.
Other
- OPC is generally supportive of the use of anonymization and de-identification as a privacy enhancing technique to glean insights from data while reducing privacy risks. Indeed, our Office issued in April 2020 a Framework for the Government of Canada to Assess Privacy-Impactful Initiatives in Response to COVID-19. Said framework encouraged organizations to use de-identified and aggregated data whenever possible, while cautioning of the existence of a residual risk of re-identification.
- However, de-identification is an active research field and no method has yet been found that eliminates re-identification risks, and the resulting privacy risks to individuals. This highlights the importance of privacy laws that are modernized to expressly deal with de-identified personal information.
- The Privacy Act treats personal information as a binary concept, and therefore does not fully capture the nuances associated with de-identified personal information. As a technique in information management, de-identification is often motivated by a desire to strike the right balance between, on one hand, preserving the utility of data derived from personal information, and reducing privacy risks associated with that data on the other.
- Consequently, the more we increase the utility of de-identified data, the more we add information deduced from personal information and the more we move away from anonymity and vice versa. Needless to say, removing all personal data elements from a dataset would render that data useless. However, even de-identified information presents at least some risk to privacy. Generally speaking, the more data elements that remain, the greater the risk to privacy. This nuance is not captured by the binary approach in the current legal framework, especially since the assessment of the risk associated with re-identification is not static, but rather can evolve over time.
International benchmarking
- Our benchmarking against privacy legislations in other jurisdictions regarding the use of de-identified information highlighted a certain heterogeneity with respect to the definition of de-identified information and on whether to consider it personal information subject to the provisions of national laws or, on the contrary, as anonymized data that is outside of the scope of law. The benchmark exercise, although not exhaustive, did not identify any country that chose to include in its law, provisions that are customized to, and specific to, this category of information.
- Regarding the use of mobility data to combat the pandemic, the benchmark illustrated that most countries have integrated this measure into their response to the COVID-19 pandemic.
- In the European Union, recital 26 of the GDPR states that pseudonymizedFootnote 14 data should be considered as personal data whereas anonymized data should not. Consequently, the principles enshrined in the European regulation apply to pseudonymized data, which can be assimilated to de-identified data at the device level mentioned above, and not to anonymized data. On another note, the opinion 05/2014 on Anonymization Techniques adopted on April 10, 2014 by the Article 29 Data Protection Working Party, the predecessor of the European Data Protection Board (“EDPB”) states that “Anonymisation constitutes a further processing of personal data; as such, it must satisfy the requirement of compatibility by having regard to the legal grounds and circumstances of the further processing”.
- In the United States, the Health Insurance Portability and Accountability Act (“HIPAA”) includes two methods: ‘expert determination’ and ‘safe harbor’, that covered entities under HIPAA can use to de-identify protected health information. Once de-identified, said information is no longer protected under HIPAA and can be used freely to glean valuable insights about population health. Similarly, the California Consumer Privacy Act (“CCPA”) does not restrict businesses from collecting, using, retaining, selling, or disclosing consumer information that has been de-identified or aggregated as it does not consider these categories of data as personal data. Conversely, pseudonymized consumer information is considered under the statute as personal data.
- In the United Kingdom, the Information Commissioner’s Office (“ICO”) is of the view that “Generalised location data trend analysis is helping to tackle the coronavirus crisis. Where this data is properly anonymised and aggregated, it does not fall under data protection law because no individual is identified”.
- In Australia, the Office of the Australian Information Commissioner (“OAIC”) published guidance on the subject matter that considers de-identification as a privacy-enhancing tool and advised that information that has undergone an appropriate and robust de-identification process is not personal information and is therefore not subject to the Australian Privacy Act. Said guidance added that: “whether information is personal or de-identified will depend on the context. Information will be de-identified where the risk of an individual being re-identified in the data is very low in the relevant release context (or data access environment). Put another way, information will be de-identified where there is no reasonable likelihood of re-identification occurring”.
- In its “ADVISORY GUIDELINES ON THE PERSONAL DATA PROTECTION ACT FOR SELECTED TOPICS”, the Personal Data Protection Commission (“PDPC”) in Singapore considers data that has been properly anonymized as no longer being personal data and therefore not subject to the provisions of the Singaporean Personal Data Protection Act. The PDPC also clarifies that it does not assimilate de-identified data to anonymized information.
- On another note, the use of mobility data to analyze human mobility dynamics to inform decision making in many topics, such as transportation and disease surveillance, is not novel. De-identified and aggregated mobility data have been used in the past to fight Ebola in Africa, Zika in Brazil, and swine flu in Mexico.
- Regarding the current pandemic, health authorities, researchers and NGOs in many foreign regions worldwide leveraged this type of information to shape their policies aiming to curb the spread of COVID-19. Examples include Argentina, Austria, Brazil, Chile, China, Colombia, Curaçao, Democratic Republic of Congo, Ecuador, European Union, Germany, Ghana, Greece, Haiti, Italy, Japan, New York, Spain, Sweden, Poland, Portugal, United Kingdom, etc.
Transparency
- Concerns were raised by Canadians regarding the lack of transparency in PHAC’s collection and use of mobility data. This raises the important question as to whether PHAC was sufficiently transparent to the public on its use of mobility data, notwithstanding the fact that it was de-identified and aggregated.
- The Privacy Act does not impose any transparency obligations on PHAC as it did not collect personal information as defined under the Act. Nevertheless, we understand Health Canada's position to be that the government took concrete actions to inform Canadians about PHAC’s use of mobility data. Two specific examples were cited :1-the prime Minister’s news releaseFootnote 15 on March 23, 2020 that announced support to BlueDot and 2- the COVIDTrends webpage which included an indicator about Canadians’ mobility change over a week. The COVIDTrends page was accessed by at least 1.7 million visitors.
- These measures were not sufficient to adequately inform Canadians about how their mobility data was being used. In fact, both measures required Canadians to proactively consult specific websites to inform themselves of the program(s). Further, the news release regarding the support to BlueDot did not mention that BlueDot would use and rely on Canadians’ mobility data to produce its disease analytics. In the future, we recommend that more efficient, targeted, and accessible communication channels be used in order to achieve better transparency. Examples of such channels include press releases or press conferences properly relayed via the media that explain how personal information would be used in government programs.
- OPC’s Commissioner at the time, Mr. Daniel Therrien, nuanced that “most Canadians whose data was used did not know their data was used.” and he opined that “both the government and the private sector, could have done more to inform users that their data was used for these purposes.” He added in a statement following the release of the ETHI report that “greater flexibility to use personal information for the public good, including public health purposes, should come with greater transparency and accountability”.
- By way of comparison, other projects that also used de-identified and aggregated data in support of COVID-19 research opted for different channels. For example, partnership between TELUS and the Natural Sciences and Engineering Research Council of Canada (NSERC) was announced through a news release at TELUS’ website.
Conclusion
- We take this opportunity to highlight and remind public and private organizations that use and/or procure de-identified data of several principles that should be considered. Most of the following are consistent with recommendations of ETHI’s study report.
- There is a broad consensus that all de-identification techniques entail a residual risk to privacy that may increase over time, and that anonymization is an evolving area. With this in mind, organization that produce or collect de-identified data should continually assess the appropriateness of any de-identification techniques and related safeguards against identification. They should employ de-identification as a privacy enhancing technique, not only as a manner to achieve compliance with legislative requirements.
- Although we did not examine the issue in this Report, it is incumbent upon data-holders to ensure that any third parties from which they source data have themselves collected data and personal information in a manner that respects privacy law obligations of collection and informed consent. Organizations that procure de-identified data are therefore accountable and should conduct their due diligence beyond ensuring that data is anonymized and falls outside the scope of privacy laws.
- Public organizations, such as PHAC, should be transparent with regards to the use of de-identified information and make every effort to publicize such uses and to inform concerned individuals of its purposes, the sources of data, and safeguards implemented to protect it and maintain its anonymity.
- We also recommend that the federal privacy laws be amended to include a clear legal framework that defines the different types of de-identified data and that specifies the rules that should govern the production, retention, use, disclosure, and collection of each type.
- Any new legal framework should consider the specificity of de-identified data and draw the lesson from the limits of the current binary system that can be either suboptimal with regards to privacy protection or be a hurdle to realizing the public benefits of big data. The framework can for instance be tailored to the degree of anonymity of data and balance the benefits of using de-identified data against the residual risk to privacy. It would also ideally include clear rules on how to quantify residual risks to privacy and define acceptable thresholds to compare them with.
- PHAC accepted our recommendations and shared the following measures that it took or plans to take, in order to implement OPC’s recommendations:
- PHAC established an internal working group, that includes the PMD, to improve transparency with the public about what PHAC does with the data it collects;
- PMD has developed an internal tool to assess the risks of de-identified data and to determine whether it is within the threshold of “serious possibility” of re-identification. PMD provides advice to PHAC and HC programs and makes recommendations towards mitigating these risks and ensuring compliance with the Privacy Act and Treasury Board privacy policies;
- PHAC is working with partners to improve transparency and public trust, and to enhance privacy protections. This work includes publishing sample and open data sets and algorithms used to de-identify, anonymize and aggregate data, using and developing synthetic datasets, ensuring the capacity to test privacy guarantees; and enhancing in-house technical skill to efficiently execute this work.
- PHAC and HC are ensuring due diligence when entering into agreements with third parties from whom data is sourced by including appropriate privacy clauses and measures to safeguard both personal information and de-identified information, where appropriate.
- Date modified: