Magazine #1 | Summer 2022
Less Is More: Why a Data Diet Can Benefit Artificial Intelligence
- In an interview with Dr. Michaela Regneri
Computational linguist Michaela Regneri has investigated AI e-commerce applications for the Hamburg-based online shopping platform OTTO. She has developed recommendation algorithms that are strategically relevant for the shopping platform while limiting the need for resources in AI development. Regneri is an advocate of data minimalism, and she wants to establish it as an efficiency strategy for Artificial Intelligence in the data economy. Her aim is to use minimal data while at the same time preventing a drop in data quality.
Interview with
Dr. Michaela Regneri
”Complex data processing requires more time, staff, and expensive systems.
Minimalism refers to the amount of data we process with AI. We started with the question: What is the actual value of the data we process? As we aim for efficiency, that question is interesting both economically and ecologically. We want good, reliable AI that does what it is supposed to do and is as efficient and effective as possible. So, the less data needed for the same procedure, the better the result.
In our year-long project for OTTO, we first examined how we could determine which of the various data points or how much of the entire data collection was useful for the AI. Our aim was to determine the use-based value of the data. We conducted practical experiments and tested the individual algorithms to ascertain if feeding them with data really helped the system. Which data lead to better results for the algorithm and which make it worse? We saw that, beyond a certain point, saturation occurred, and adding new data became unproductive.
We tried different ways. The simplest method is to leave out data and see how well the algorithm works with half the data initially, then with one-third of the data or with two-thirds of the data. This is a sensitivity analysis, and it works in principle like a reverse allergy test. With allergies, we look for harmful influences; we identify data that have a positive influence. If a data point is omitted and the system improves, then it was probably a bad data point. If it is omitted and nothing happens, then we know it is unnecessary for the system to function. And if the system gets worse, we have discovered a valuable data point.
There is a widespread tendency to throw in all the data you can get – particularly in industrial environments, where there is a lot of data, sometimes way more than you need. Sure, you might achieve an incremental improvement of half a percent, but to make the algorithm really efficient, you need to remove the harmful data. For example, we have had the problem of bots being let loose on a webpage. They wildly click around the shop, causing specific items to suddenly look very popular. This is bad for the system, of course. Sometimes, we unintentionally harm the AI ourselves, for example with a marketing promotion like a deal of the day, which results in lots of people clicking on a cheap article, when actually they are not necessarily interested in the item itself, just the discount. So, we are not dealing with their natural buying behavior, but with their reaction to something we did. This makes it difficult to draw any useful conclusions about the AI’s behavior.
”We select those data that actually contain important information, and make sure that the group is not larger than necessary for the desired result.
Data minimalism reduces costs. Complex data processing requires more time, more staff and expensive systems. Data protection and compliance add to the cost. The less data that is managed in the cloud, the lower the cost of the cloud. Because the innovation loops become shorter, the model can be trained faster and has a greater innovation potential. The algorithms can be tested faster, and ultimately, eliminating harmful data points leads to better performance. In an economic context, this always means more profit.
Reducing data means improving privacy and cybersecurity. Every sensitive data point that gets moved around unnecessarily is an extra security risk. Moreover, data minimalism can help curb discriminatory patterns in data sets.
People often argue that you need a lot of data for data sets to be balanced and non-discriminatory. There are, however, a few older techniques, based on making data sets smaller, that avoid discrimination. These come from medicine and medical statistics. For medicine, the most interesting group is the smallest one: the people who are actually sick. In order to do justice to this group in medical applications, it is possible to make the larger group of healthy people smaller. This can be understood from a data minimalist perspective: We select those data that actually contain important information, and make sure that the group is not larger than necessary for the desired result. In this way, minorities or smaller groups are given more weight, in relative terms.
Rigorous data minimalism would result in AI systems that allow us to account precisely for the effects that individual data points or data sets have on the AI system. This in turn means that we could also predict which data have a discriminating influence on the result.
For me, the exceptional aspect of this project was the ability to do applied research. Universities have limited data available. For large corporations, there is no problem managing a lot of data, because they have the computing and financial resources. At OTTO, we were able to use this data to build AI algorithms ourselves. But we didn‘t want to blindly stuff everything into the system just because we could. Less data means less computing time and lower CO2 emissions, simply because consumption is being reduced. Energy consumption runs through the entire value-added chain, and power consumption grows linearly with the amount of data processed in the algorithm.
However, there are no reliable and proven methods, either in science or in industry, that take these consequences into account and minimize them, and we need to do something about that. If we want AI and sustainability to become a reality, government agencies need to promote the kind of collaboration that we had at OTTO between science and industry.
Background
Energy consumption
Dimension:
environmental sustainability
Criteria:
energy consumption
Indicator:
measures are used to reduce the amount of data
The energy consumption of an AI system differs depending on the particular phase in its life cycle. The development phase of new AI models can be extremely energy intensive, despite growing hardware efficiency. Finding the desired model architecture, in particular, can sometimes require tremendous computing power. Energy consumption in the training and, especially, the application phase is significantly lower. However, unlike development, which is just a one-time process, these phases are sometimes repeated a massive number of times. Data minimalist approaches, which keep the data sets used for training and application small, are one way of reducing energy requirements during the training and application phases.
self-determination and data protection
Dimension:
social sustainability
Criteria:
self-determination and data protection
Indicator:
the consistent implementation of a Privacy by Design approach
A consistent Privacy by Design approach takes data privacy and security interests into account during the planning and development stage of digital technologies. The General Data Protection Regulation (GDPR) specifically calls for a Privacy by Design approach, although it does leave room for flexibility in implementation. Privacy by Design means, for example, that data is encrypted and anonymized, used sparingly and not merged unnecessarily with other data. Data minimalism is thus an integral part of Privacy by Design. With data-minimalist AI development, good data management and high-quality selected data discrimination risks in AI applications can be reduced. This also saves resources when it comes to the computing power necessary.
DR. MICHAELA REGNERI
Product Manager for Business Intelligence Analytics at OTTO GmbH
…is passionate about Artificial Intelligence. Her main interests lie in the fields of cognitive computing with language, images and all other types of human-machine interactions. After completing her doctorate in computational linguistics, she was initially responsible for search and data mining at the SPIEGEL-Verlag publishing company. In 2016, she began working for OTTO as a Product Manager for business intelligence analytics. Michaela Regneri is particularly interested in topics related to corporate digital responsibility and organizational development, including AI and the future of work, AI and staff development and the sustainable design of AI systems.