Data Science and Scientific Workflows

Content

The amount of data generated in scientific projects is increasing rapidly. The increase is partly due to the fact that new data-based evaluation methods allow a better and more precise analysis of scientific data. In addition, the linking of data provides new insights. This requires a systematic organization of data. The necessary knowledge of data science and computer science is equally required for both computer simulations and experimental investigations. The preparation/classification (e.g. electronic laboratory notebook) and structuring of data is a necessary step for their reuse. The lecture introduces the principles and software tools for the corresponding scientific workflows:  Python and libraries, Jupyter notebook, shell scripts and documentation with git-tools. Applications in Python include statistical methods, machine learning techniques such as classification, artificial neural networks (ANN), convolutional neural networks (CNN), and Gaussian processes (GP) for simulation planning. Furthermore, an overview is given of database systems in materials research and the FAIR data principle (findability, accessibility, interoperability and reusability).

 

 

Objective:

 

Students will be able to

  • organize and document data electronically
  • handle data formats: simple, hierarchical ones
  • deal with software management tools (git, gitlab)
  • record scientific workflows in detail and ensure traceability
  • use python-based libraries for data handling and analyses
  • apply the fundamentals of machine learning

 

Detailed lecture content:

 

  1. Introduction: the need for data science and computer science basics.
  2. Programming and programming paradigms using Python
  3. Software and data management: local and central management (git, gitlab)
  4. Data processing: Automating tasks --- from scripts to workflow (examples from simulation and experiment)                                                            
  5. Electronic lab book
  6. Machine Learning: Classification, Neural Networks, Gaussian Process 

 

Exercise:

The lecture material will be deepened in the exercises (exercise 1SWS).

 

Mode of examination:

  • Project:  Project topics from the areas
    • Material simulation and workflow
    • Data organization and analysis: from experiment or simulation
    • Presentation of the project in a 15 minute lecture + questions
  • Preliminary examination performance: successful start to project work
Language of instructionGerman
Bibliography

Literatur: 

  • Handbuch Data Science, Hanser Verlag
  • Effective Computation in Physics, Scopatz & Huff, O’Reilly 2015
  • Python Data Science Handbook, J. VanderPlas, O’Reilly 2016. 
  • Materials Data Science, S. Sandfeld, Springer, 2024.   
Organisational issues

Die Vorlesung wurde ins Wintersemester verschoben.