Skip to main content

Magazine #1 | Summer 2022

Automatic Oblivion

Responsible Data Management in Machine Learning

Having a say over one’s own data and being able to exercise one’s right to be forgotten is essential, especially regarding the increasingly pervasive use of AI systems in our everyday life. As a recent case reported in the Netherlands demon- strates, consequences can be devastating if this right isn’t enforced on time. A couple lost their baby through a miscarriage and were subsequently continuously exposed to baby product ads online – a traumatizing experience. They contacted the e-commerce company that was advertising the baby products, but were told that the company had no technical means of disabling the online recommendations.

Such online recommendations are based on automated de- cision-making (ADM) systems – AI applications increasingly used to automate decisions that have an impact on people’s lives. These systems are based on sample data from which prediction models are derived with Machine Learning techniques. Such ADM systems are used in domains as varied as credit and lending, medical diagnosis and hiring.

The risks and opportunities linked to their widespread use are garnering much attention from policy-makers, scientists and the media. An important subset of these risks arise from technical challenges with respect to the management of the data stored and processed by ADM systems. As the example above shows – privacy protection and the ability to have a say over one’s personal data is essential.

Technical bias introduced by computational systems

Much of the current discussion about algorithmic fairness of automated decisions focuses on so-called pre-existing bias, which has its origins in society. In ADM systems, this type of bias often exhibits itself in the sample data for prediction models. Technical bias arises in the data due to the technical system’s operations. The risks of introducing technical bias in data-driven ADM abound, but a technical fix is possible – which isn’t the case with pre-existing bias.

Technical bias is a consequence of the “lab conditions” under which experts usually design the algorithmic component of an ADM system. They work with a fixed and clean dataset of training examples and try different algorithmic approaches to find a prediction model that works well on this dataset. However, once the full ADM system is developed and applied in the real world under “production conditions,” the way in which data is produced for the prediction model changes.

ADM systems typically process data from multiple sources (oftentimes other technical systems) that continuously produce new data. The prediction model must regularly be adjusted to the new data. The system must combine the data from all sources and prepare it in a form that the prediction model understands. In this data preparation process, technical bias can be introduced by programming errors or a misrepresentation of groups in the generated data. It can even be introduced through seemingly innocuous operations, for instance if demographic data is filtered by zip code during data preparation, as the place of residence potentially correlates with sensitive demographic factors such as age, income level or ethnicity. As a result, the prediction model might produce less reliable predictions for groups of individuals that are not well represented in the data due to filtering operations.

Enforcing the “right tob be forgotten” in a timely manner

An orthogonal dimension related to data management in ADM systems is their compliance with laws guaranteeing privacy and digital self-determination rights. A prominent example is the “right to be forgot- ten” (Article 17 of the General Data Protection Regulation, GDPR). It requires companies and institutions that process personal data to delete user data upon request: “The data subject shall have the right to […] the erasure of personal data concerning him or her without undue delay […] where the data subject with- draws consent.” The GDPR law does not specify how soon data must be erased after a deletion request, yet it states the “obligation to erase personal data without undue delay” using “appropriate and effective measures.” Currently, data erasure seems to be a rather tedious and lengthy process in practice. The erasure from active systems in the Google cloud, for example, can take up to two months.

This is why we need ADM systems with “unlearning” capabilities enabling them to delete users’ interaction data on request and thereby to adjust their predictions on demand. But this poses many challenges with respect to the algorithmic and computational efficiency of “updating” existing prediction models.

Academia and industry have just started addressing these challenges (in part motivated by pending regulations). But even if technological foundations for a responsible data management in ADM systems were established, we would still need best practice solutions. In order to find them, it is crucial to have access to real-world ADM systems, which we don’t presently have because most of these systems are propri- etary and run by private enterprises.

Backround 

self-determination and data protection

Dimension:

social sustainability

Criteria:

self-determination and data protection

Indicator:

ensuring informational self-determination

People must be enabled to maintain autonomy over their personal data. This can work through simple visualizations, notifications, and consent or revocation mechanisms. Users should be informed as soon as AI systems use or collect personal data. They should have a say in the use of their data and not be restricted in their self-determined actions by mechanisms that influence behavior, such as nudging or dark patterns.

DR. SEBASTIAN SCHELTER

Professor at the University of Amsterdam

… is an Assistant Professor at the University of Amsterdam, conducting research at the intersection of data management and Machine Learning. In his work, he addresses data-related problems that occur in the real-world application of Machine Learning. Examples are the automation of data quality validation, the inspection of Machine Learning pipelines via code instrumentation or the design of Machine Learning applications that can efficiently forget data. Schelter makes most of the research code that he writes available under an open source license and is an elected member of the Apache Software Foundation.