Skip to main content

Jacob Schmieder

Open to PhD Opportunities

I’m actively seeking a PhD position that blends theoretical physics with machine learning. As a Senior Data Scientist and AI Advisor, I build research-grade infrastructure, mentor interdisciplinary teams, and turn complex datasets into actionable insight. I’m ready to bring that expertise to a doctoral group tackling ambitious, data-driven physics questions.

  • Focus areas: Machine learning for physics, AI infrastructure, scientific computing
  • What I bring: Proven track record of shipping reproducible ML experiments, enabling HPC pipelines, and advising on trustworthy AI strategies
  • Availability: Actively speaking with labs and ready for exploratory collaborations now

I’m open already—just tap ā€œContact meā€ whenever you’d like to talk.

About me

I am a Senior Data Scientist at the Deutsches Biomasseforschungszentrum (DBFZ) in Leipzig, where I pair theoretical physics training with hands-on AI and HPC engineering. My work spans building probabilistic and neural workflows for research questions, operating Slurm-based clusters and containerised pipelines, and leading DevOps automation with GitLab and GitHub Actions. I also design open-source tooling such as ScrAIbe and ScrAIbe-WebUI to make research infrastructure more accessible, while mentoring students through thesis supervision, workshops, and leadership roles like chairing the Physics Student Council. I’m now seeking a PhD that lets me bring this interdisciplinary mix of physics, machine learning, and infrastructure to ambitious scientific problems.
Explore my skills

Education

2019 – 2022
Martin-Luther-University Halle-Wittenberg
Master of Science in Medical Physics
2015 – 2019
Martin-Luther-University Halle-Wittenberg
Bachelor of Science in Medical Physics

Professional Experience

2023 – Present
Senior Data Scientist & AI Advisor
Deutsches Biomasseforschungszentrum (DBFZ). Leipzig, Germany

Senior Data Scientist & AI Advisor

Steering AI strategy for sustainable energy research I lead applied data science projects at the Deutsches Biomasseforschungszentrum (DBFZ), helping research groups unlock value from complex bioenergy datasets. My role combines high-level advisory work with hands-on implementation across the machine learning lifecycle. Design data and machine learning strategies that align with project goals, funding requirements, and ethical guardrails. Build and maintain reproducible analytics pipelines that empower researchers to iterate quickly without compromising rigour. Shape the institute’s AI-HPC roadmap, selecting tooling and infrastructure that balance performance, cost, and long-term maintainability. Mentor interdisciplinary teams on best practices in data governance, documentation, and collaborative experimentation. Introduce probabilistic reasoning methods and calibration routines that raise confidence in model predictions for policy-facing research. Lead the operation of Slurm clusters, container stacks, and GitLab/GitHub automation, including CVE-driven patching and security monitoring. Build open-source tooling such as ScrAIbe and ScrAIbe-WebUI so non-coders can access transcription and diarisation pipelines. This position allows me to bridge theoretical physics training with real-world impact, ensuring that machine learning accelerates rather than obscures scientific insight.

Blending data science with physics

From research labs to advisory roles, I build and guide AI solutions that support scientific discovery and resilient infrastructure.

A selection of my work

ScrAIbe-WebUI: No-Code Access to Accurate Transcripts

After releasing ScrAIbe, many colleagues still needed a friendly interface—they were researchers, not DevOps engineers. ScrAIbe-WebUI solves that gap with a FastAPI backend and a responsive frontend that wraps the underlying containers.

Highlights:

  • Drag-and-drop uploads with progress tracking for long recordings.
  • Profile presets (interviews, meetings, field recordings) that change diarisation + cleaning parameters without touching config files.
  • Fine-grained access control so institutes can delegate transcription duties without exposing cluster credentials.

The UI talks to the same API that scripts use, which means power users keep their automation while the rest of the team gets a reliable, accessible experience.

Check it out on Github

ScrAIbe: Research-Grade Transcription

ScrAIbe started as an internal tool for converting laboratory interviews and sensor briefings into searchable text. Off-the-shelf speech-to-text services struggled with German-English code switching and overlapping speakers, so I built a pipeline that combines state-of-the-art ASR with probabilistic diarisation and confidence scoring.

Key pieces include:

  • A modular inference stack (Whisper + Pyannote) orchestrated through containerised workers so labs can run the service on-premises.
  • A calibration layer that flags low-confidence passages and surfaces timestamps for quick review.
  • Automated QC reports that let researchers jump straight to the segments that need manual corrections. The project is open source because reproducible infrastructure should not be a black box. You can read the documentation, run the containers locally, or extend the diarisation modules for your own corpora.
Check it out on Github

Master’s Thesis: Machine-Learned Compression of Gaussian Basis Sets

My master’s thesis explored whether machine learning can compress Gaussian basis sets for ab initio density-functional theory (DFT) calculations without sacrificing accuracy.

  • Introduced a projection-based loss that aligns a compact, optimisable basis to a larger reference set, enabling joint optimisation of exponents and contractions with gradient-based methods via automatic differentiation.
  • Evaluated the learned bases across diverse small molecules, noting consistent energy-error reductions for minimal and split-valence sets (especially STO-nG), while gains over modern polarised/augmented references remained modest.
  • Analysed trends with molecule size and basis-family choice, compared optimisers and learning schedules, and catalogued failure modes such as overfitting, divergence, and missing polarisation/diffuse character.
  • Discussed how the approach points toward data-driven, atom-centric basis design and integration into mixed-basis workflows for larger systems.

The study highlights both the promise and the current limitations of ML-driven basis reduction, motivating richer datasets and hybrid strategies.

Download thesis (PDF)

Bachelor’s Thesis: EEG-Based Assessment of Consciousness

During my bachelor’s thesis I examined if EEG markers could serve as a real-time index of consciousness for patients undergoing propofol-sedated endoscopy.

  • Two-channel scalp EEG was bandpass-filtered into α (8–13 Hz), low-β (13–20 Hz), high-β (20–31 Hz), and γ (31–49 Hz) bands. Instantaneous amplitude, phase, and frequency were extracted via Hilbert transforms and resampled to 1 Hz.
  • These features were compared with clinician annotations (including RASS) using ROC/AUC analysis to identify the most discriminative markers.
  • Band-limited amplitudes—especially within the β range, with α supporting transitions between relaxed wakefulness and light sedation—proved most informative, whereas phase-synchronisation metrics generalised poorly across subjects.
  • Variability in signal quality, occasional amplifier saturation, sampling-rate jitter, and metadata granularity prevented a single universal metric, motivating subject-specific modelling and richer datasets.

The results support multi-feature, personalised approaches for closed-loop sedation monitoring and highlight the need for larger, standardised corpora.

Download thesis (PDF)