Dr. Michaela Regneri
”Complex data processing requires more time, staff, and expensive systems.
Minimalism refers to the amount of data we process with AI. We started with the question: What is the actual value of the data we process? As we aim for efficiency, that question is interesting both economically and ecologically. We want good, reliable AI that does what it is supposed to do and is as efficient and effective as possible. So, the less data needed for the same procedure, the better the result.
In our year-long project for OTTO, we first examined how we could determine which of the various data points or how much of the entire data collection was useful for the AI. Our aim was to determine the use-based value of the data. We conducted practical experiments and tested the individual algorithms to ascertain if feeding them with data really helped the system. Which data lead to better results for the algorithm and which make it worse? We saw that, beyond a certain point, saturation occurred, and adding new data became unproductive.
We tried different ways. The simplest method is to leave out data and see how well the algorithm works with half the data initially, then with one-third of the data or with two-thirds of the data. This is a sensitivity analysis, and it works in principle like a reverse allergy test. With allergies, we look for harmful influences; we identify data that have a positive influence. If a data point is omitted and the system improves, then it was probably a bad data point. If it is omitted and nothing happens, then we know it is unnecessary for the system to function. And if the system gets worse, we have discovered a valuable data point.
There is a widespread tendency to throw in all the data you can get – particularly in industrial environments, where there is a lot of data, sometimes way more than you need. Sure, you might achieve an incremental improvement of half a percent, but to make the algorithm really efficient, you need to remove the harmful data. For example, we have had the problem of bots being let loose on a webpage. They wildly click around the shop, causing specific items to suddenly look very popular. This is bad for the system, of course. Sometimes, we unintentionally harm the AI ourselves, for example with a marketing promotion like a deal of the day, which results in lots of people clicking on a cheap article, when actually they are not necessarily interested in the item itself, just the discount. So, we are not dealing with their natural buying behavior, but with their reaction to something we did. This makes it difficult to draw any useful conclusions about the AI’s behavior.
”We select those data that actually contain important information, and make sure that the group is not larger than necessary for the desired result.
Data minimalism reduces costs. Complex data processing requires more time, more staff and expensive systems. Data protection and compliance add to the cost. The less data that is managed in the cloud, the lower the cost of the cloud. Because the innovation loops become shorter, the model can be trained faster and has a greater innovation potential. The algorithms can be tested faster, and ultimately, eliminating harmful data points leads to better performance. In an economic context, this always means more profit.
Reducing data means improving privacy and cybersecurity. Every sensitive data point that gets moved around unnecessarily is an extra security risk. Moreover, data minimalism can help curb discriminatory patterns in data sets.
People often argue that you need a lot of data for data sets to be balanced and non-discriminatory. There are, however, a few older techniques, based on making data sets smaller, that avoid discrimination. These come from medicine and medical statistics. For medicine, the most interesting group is the smallest one: the people who are actually sick. In order to do justice to this group in medical applications, it is possible to make the larger group of healthy people smaller. This can be understood from a data minimalist perspective: We select those data that actually contain important information, and make sure that the group is not larger than necessary for the desired result. In this way, minorities or smaller groups are given more weight, in relative terms.
Rigorous data minimalism would result in AI systems that allow us to account precisely for the effects that individual data points or data sets have on the AI system. This in turn means that we could also predict which data have a discriminating influence on the result.
For me, the exceptional aspect of this project was the ability to do applied research. Universities have limited data available. For large corporations, there is no problem managing a lot of data, because they have the computing and financial resources. At OTTO, we were able to use this data to build AI algorithms ourselves. But we didn‘t want to blindly stuff everything into the system just because we could. Less data means less computing time and lower CO2 emissions, simply because consumption is being reduced. Energy consumption runs through the entire value-added chain, and power consumption grows linearly with the amount of data processed in the algorithm.
However, there are no reliable and proven methods, either in science or in industry, that take these consequences into account and minimize them, and we need to do something about that. If we want AI and sustainability to become a reality, government agencies need to promote the kind of collaboration that we had at OTTO between science and industry.