Martin Luther University Halle-Wittenberg

Further settings

Login for editors

BigFlow

BigFlow is a data-driven workflow approach to mediate between data preparation, machine learning and interactive, visual data exploration in big data applications. It consist of three components: a lightweight Java library for data-driven workflows, a modeling framework that combines entity-relationship models with graphical models from machine learning and an abstract data manipulation language to incrementally build complex data transformation operations.

Machine learning models are applicable to many different domains. E.g. topic models---a probabilistic graphical model---can be applied to documents, bioinformatics data, or, images and videos. However, each domain needs special data preparation. Further, after learning, the results from machine learning need to be translated back into the context of application data in order to be analyzed, interpreted and evaluated by domain experts.

There are gaps between, data preparation and machine learning as well as between machine learning and interactive, visual data exploration. State-of-the-art is close this gaps with scripts containing glue code that transforms the output of one step into the input of the next one. BigFlow offers to use instead a lightweight Java library for data-driven workflows, called command manager. It structures preprocessing and analysis pipelines into simple commands that communicates thought an context object. The library supports online generation of workflows based on dependencies of the commands as well as nested workflows. This allows easy reuse of software components as well as adaptation of functionality  without writing code just by changing configurations.

An example workflow of the data preparation for topic modeling of documents.

An example workflow of the data preparation for topic modeling of documents.

An example workflow of the data preparation for topic modeling of documents.

The second component of BigFlow is a modeling framework that combines entity-relationship models with graphical models from machine learning. This framework allows application developers and database designer to use machine learning models in applications without fiddling with the mathematical details of probabilistic models.


The presence of big data adds the difficulty to process large data with workflows composed of many small parts. BigFlows strategy to perform well on large data is to use an abstract data manipulation language. This allows to incrementally compose complex data transformation operations during the steps of a workflow. After a data transformation operation is readily specified it is executed in a central or distributed manner depending on the underlying data processing infrastructure. The decouples application and machine learning logic from data processing systems like relational databases, Hadoop clusters or graph databases.


Road map

  • Implementation of the workflow library
  • Development of the modeling framework
  • Specification of the abstract data manipulation language


Contact: Alexander Hinneburg,
Contributors:

  • Benjamin Schandera
  • Frank Rosner
  • Mattes Angelus
  • Sebastian Bär
  • Matthias Pfuhl

Up