Skip to main content

Jacob Schmieder

Open to PhD Opportunities

I'm a physicist and Senior Data Scientist working with machine learning and AI tools in scientific applications. I'm currently seeking a PhD position where I can focus on physics-informed AI, developing reliable ML methods and scalable AI infrastructure for data-intensive physics.

  • Focus & research areas: Machine learning in scientific applications, physics-informed ML, probabilistic and uncertainty-aware AI, AI infrastructure, and scientific computing and simulation-driven studies
  • What I bring: Experience transforming state-of-the-art AI models into robust open-source products and deploying these systems on HPC and cluster environments
  • Availability: Currently seeking groups working in these areas and open to exploratory collaborations

If this aligns with your group, just tap Contact me and let's talk.

Contact me

About me

I am a Senior Data Scientist at the Deutsches Biomasseforschungszentrum (DBFZ) in Leipzig, where I combine a background in theoretical physics with hands-on AI and HPC engineering. I turn existing machine learning models into reliable open-source tools such as ScrAIbe and ScrAIbe-WebUI, run large-scale workloads on an HPC cluster with Slurm, and deploy containerised services into production with CI/CD workflows powered by GitLab and GitHub Actions. I occasionally experiment with probabilistic modelling and information-field-theory-inspired ideas. Outside of work, I tinker with a small homelab built around Proxmox, Raspberry Pi, and home automation, and I am now looking for a PhD position where I can use this mix of physics, machine learning, and infrastructure to tackle demanding, data-intensive physics problems.
Explore my skills

Education

2019 – 2022
Martin-Luther-University Halle-Wittenberg
Master of Science in Medical Physics
2015 – 2019
Martin-Luther-University Halle-Wittenberg
Bachelor of Science in Medical Physics

Professional Experience

2023 – Present
Senior Data Scientist & AI Advisor
Deutsches Biomasseforschungszentrum (DBFZ). Leipzig, Germany

Senior Data Scientist & AI Advisor

My Role at DBFZ & KIDA I work on applied data science projects at the Deutsches Biomasseforschungszentrum (DBFZ), including contributions to the interdisciplinary KIDA project. I support research groups at DBFZ and partner institutions in turning complex datasets into robust scientific and policy-relevant results. My role blends strategic AI advisory work with hands-on implementation across the machine-learning lifecycle. Key responsibilities Design data and ML strategies aligned with project goals, funding constraints, and ethical guardrails. Provide AI guidance for researchers, from problem framing to evaluation and publication. Co-shape the institute’s AI-HPC roadmap and select tooling/infrastructure balancing performance, cost, and maintainability. Plan and help build shared HPC/AI infrastructure across multiple institutions in the KIDA consortium. Apply probabilistic methods in small, practical ways to better represent uncertainty in policy-facing ML results. Operate Slurm clusters, container stacks, and GitLab/GitHub automation, including CVE-driven patching and security monitoring. Develop open-source tools like ScrAIbe and ScrAIbe-WebUI so non-coders can use transcription and diarisation pipelines. Advise IT staff on translating research and infrastructure plans into stable production services. Advise Bachelor’s and Master’s theses and contribute to project and grant applications. This work lets me connect my theoretical physics background to real-world impact—using ML to clarify, not obscure, scientific insight.

Blending data science with physics

From research labs to advisory roles, I build and guide AI solutions that support scientific discovery and resilient infrastructure.

A selection of my work

Numerical Information Field Theory for Acoustic Monitoring (Poster)

Passive acoustic monitoring provides continuous, high-resolution recordings, but real ocean soundscapes are messy: noise is non-stationary and biologically important click events can be sparse. These conditions often challenge purely discriminative detectors. This poster explores Information Field Theory (IFT) as a Bayesian alternative for acoustic denoising. IFT treats signals as latent fields and reconstructs their posterior mean together with calibrated uncertainty.

Methodologically, the spectrogram is modeled as data generated from a latent acoustic field plus noise. A time-periodic, frequency-random prior encodes expected click structure, and variational optimisation of the Gibbs free energy yields the reconstruction. Minimising this free energy corresponds to variational Bayesian inference within IFT.

Implemented in NIFTy.re, the numerical IFT library for Gaussian-process priors and scalable variational inference, the approach suppresses background clutter in sperm-whale recordings and recovers both regular and slow click types without hand-crafted filters.

Outlook: next steps are (1) scaling to automatic click detection in large, sparse archives, and (2) reconstructing fine click morphology (envelope, phase, sub-pulse spacing) to study internal click structure.

Presented at HAICON25, Karlsruhe (03.06.2025) — Schmieder J., Albrecht S., Mousavi H., Fais A.

Read the poster (PDF)

ScrAIbe-WebUI: No-Code Transcription for Research Teams

After releasing ScrAIbe, many colleagues still needed a friendly interface—they were researchers, not DevOps engineers. ScrAIbe-WebUI closes that gap with a Gradio-based, no-code frontend for the same backend pipeline. :contentReference[oaicite:6]{index=6}

Highlights:

  • Drag-and-drop uploads + recording from microphone or webcam, designed for long files and real lab workflows.
  • Two operating modes:
    • Synchronous (real-time) transcription for live use.
    • Asynchronous processing that can integrate with a mail client: users upload in the UI and get transcripts back automatically by email.
  • Broad media support through FFmpeg compatibility (audio and video in most common formats).
  • Advanced model controls in the UI: choose any Whisper model or the faster-whisper CPU backend, and toggle Pyannote diarisation when needed.
  • Easy lab deployment via Docker or Docker Compose, plus a config.yaml for clean customization of defaults and UI behavior.

Power users keep their Python/CLI automations in ScrAIbe, while everyone else gets reliable, reproducible transcription through a browser.

Check it out on GitHub

ScrAIbe: Research-Grade Transcription & Diarisation

ScrAIbe began as an internal research tool for turning lab interviews, field recordings, and technical briefings into searchable, citable text. Standard speech-to-text services struggled with noisy environments, German/English code-switching, and overlapping speakers, so I built a pipeline tailored to research-grade audio.

ScrAIbe is a modular, multilingual transcription and speaker pipeline:

  • Whisper-based ASR for high-accuracy transcription and optional translation of segments.
  • Speaker diarisation + recognition via Pyannote, with VoxCeleb embeddings for robust speaker separation.
  • Automatic language identification using VoxLingua to handle mixed-language recordings cleanly.
  • Multiple entry points: a Python API for full control, a CLI for batch jobs, and an optional lightweight Gradio app for quick local runs.
  • Server-friendly deployment through Docker when you want consistent lab/on-prem setups.

ScrAIbe is open source because research infrastructure shouldn’t be a black box. If you want a fully no-code experience for teams, the companion project ScrAIbe-WebUI wraps this backend into an easy Docker-deployable web service.

Check it out on GitHub

Master’s Thesis: Machine-Learned Compression of Gaussian Basis Sets

My master’s thesis explored whether machine learning can compress Gaussian basis sets for ab initio density-functional theory (DFT) calculations without sacrificing accuracy.

  • Introduced a projection-based loss that aligns a compact, optimisable basis to a larger reference set, enabling joint optimisation of exponents and contractions with gradient-based methods via automatic differentiation.
  • Evaluated the learned bases across diverse small molecules, noting consistent energy-error reductions for minimal and split-valence sets (especially STO-nG), while gains over modern polarised/augmented references remained modest.
  • Analysed trends with molecule size and basis-family choice, compared optimisers and learning schedules, and catalogued failure modes such as overfitting, divergence, and missing polarisation/diffuse character.
  • Discussed how the approach points toward data-driven, atom-centric basis design and integration into mixed-basis workflows for larger systems.

The study highlights both the promise and the current limitations of ML-driven basis reduction, motivating richer datasets and hybrid strategies.

Download thesis (PDF)

Bachelor’s Thesis: EEG-Based Assessment of Consciousness

During my bachelor’s thesis I examined if EEG markers could serve as a real-time index of consciousness for patients undergoing propofol-sedated endoscopy.

  • Two-channel scalp EEG was bandpass-filtered into α (8–13 Hz), low-β (13–20 Hz), high-β (20–31 Hz), and γ (31–49 Hz) bands. Instantaneous amplitude, phase, and frequency were extracted via Hilbert transforms and resampled to 1 Hz.
  • These features were compared with clinician annotations (including RASS) using ROC/AUC analysis to identify the most discriminative markers.
  • Band-limited amplitudes—especially within the β range, with α supporting transitions between relaxed wakefulness and light sedation—proved most informative, whereas phase-synchronisation metrics generalised poorly across subjects.
  • Variability in signal quality, occasional amplifier saturation, sampling-rate jitter, and metadata granularity prevented a single universal metric, motivating subject-specific modelling and richer datasets.

The results support multi-feature, personalised approaches for closed-loop sedation monitoring and highlight the need for larger, standardised corpora.

Download thesis (PDF)