Masterarbeit am KSRI und KIT: Techniken für den Umgang mit Datenflut

image

Machine Learning auf kleinen & verteilten Datensätzen

In seiner Masterarbeit am Karlsruhe Service Research Institute (KSRI) untersucht prenode Teammitglied Michael Jahns Verfahren zum Umgang mit verschiedenen Datenmengen. Er erforscht, wie Unternehmen Machine Learning unter Verwendung kleiner und besonders sensibler Datensätze einsetzen können.

In der Arbeit bewertet er verschiedene ML-Techniken im Hinblick auf Performance und Rechenaufwand. Abschließend empfiehlt er Anwendern die Implementierung von Sequential Transfer Learning oder Federated Transfer Learning, da die beiden Ansätze ein hohes Potenzial aufweisen für das Lernen auf verteilten Datensätzen bei gleichzeitiger Stärkung des Datenschutzes.

Lesen Sie mehr im folgenden Abstract.

Abstract

“For Machine Learning (ML) technologies based on Neural Networks, a higher amount of data for training generally leads to better results. When ML is used in a production environment, companies often cannot exploit the full potential out of it as only small data sets are available to them. However, if many companies train similar models with the same target, they could benefit from collaborating by exchanging and aggregating their data to enlarge the amount of training data. Such consolidation may be prohibited by law or contracts for data confidentiality.

A structured literature review is conducted to select suitable techniques that enable this cooperation. To this end, we identify Federated Learning (FL), Federated Transfer Learning (FTL), Sequential Transfer Learning (STL) and data generation with Generative Adversarial Networks (GANs) as potential solutions. These techniques are implemented in five different use cases where data is distributed but similar models need to be trained. Each use case consists of multiple separated data pools and the data is preprocessed identically for each technique. The trained models differ between the use cases, but not between the techniques, thus allowing a direct comparison of the prediction performance. The results are evaluated and compared for prediction performance and computational complexity across the five use cases.

Based on the prediction performance, STL proves to be the most promising technique closely followed by FTL. FL and the data generation technique cannot achieve such improvements. Inspecting the computational complexity, the STL technique performs worse than the other techniques. The result of this thesis is a suggestion for users as to what technique they should implement based on their data set properties to increase the prediction performance on a distributed data pool.” (Jahns, 2020)

Interesse an einer Abschlussarbeit bei prenode? Kontakt aufnehmen!