Processing Workflow¶
Work in Progress
For now this is largely copied from the old README.
This section documents the technical implementation of the Lightning Tracks processing chain: how Snakemake organizes the steps, how datasets are described, where outputs go, and how to control and run the workflow on different systems.
For the physics narrative of the selection itself (why a particular cut or model was chosen, how the final performance looks), see Selection instead.
Workflow Overview¶
The LT processing pipeline is implemented as a Snakemake workflow driven by the top-level Snakefile. It:
- Applies ML filter models and reconstructions to IceCube Level 2 data.
- Manages dependencies between steps via file-based rules.
- Supports multiple dataset types (
exp,nugen,corsika,muongun). - Runs on different backends (local execution, Slurm clusters, HTCondor via profiles).
- If specified, this includes staging the input data from the Madison file system by temporarily downloading files as needed and eventually deleting them once processed.
At a high level, the workflow steps are:
01_filter_models: apply CNN/MLP filter models to Level 2 events.02_RNN_directional_reco: run RNN directional reconstruction.03_RNN_energy_reco: run RNN energy reconstruction (optional for physics).04_MuEx: run MuEx energy estimation on reconstructed tracks.05_STV: run ESTES Starting Track Veto for starting-track candidates.06_TNF_reco: run Normalizing Flow (TNF) directional reconstruction.07_final_cut_models: evaluate final starting/throughgoing MLPs.08_hdf_export: export key variables to per-run HDF5.
The postprocessing from HDF5 to merged NumPy and Csky NPY is handled by notebooks and described below.
Repository Layout (Processing-Relevant Parts)¶
The following directories and files are central to the workflow implementation:
Snakefile: main workflow file defining rules and meta-rules.config/datasets/: YAML dataset definitions, grouped by dataset type.exp/: experimental data configs (per year).nugen/: NuGen simulation configs (per set).corsika/: CORSIKA simulation configs.
local_configs/: environment-specific configs (e.g.madison.yaml,msu.yaml) that define paths and staging behavior.profiles/: cluster profiles for Snakemake (e.g.profiles/msu/for Slurm on MSU HPCC).
processing_scripts/: Python scripts called by Snakemake rules to execute individual steps (filters, reconstructions, exports, etc.).models/: on-disk locations of pre-trained ML models (not version-controlled, assumed to be present at runtime).utils/dataset_manager.py: core logic for interpreting dataset configs and constructing lists of input and output files.
The rest of the repository (e.g. notebooks/, docs/) is used for postprocessing and documentation.
Dataset Configuration and Management¶
Datasets and Samples¶
The workflow supports several dataset categories, each described by YAML configs under config/datasets/:
- Experimental data (
exp) – yearly Level 2 data (e.g. 2017, 2018, …), with runs, GCDs, and livetimes explicitly listed. - Neutrino-Generator (
nugen) – signal-dominated neutrino simulations, organized by set ID. - CORSIKA (
corsika) – atmospheric muon background simulations. - MuonGun (
muongun, optional) – alternative atmospheric muon generator; currently not central but supported by the same machinery.
Each config specifies:
dataset_idanddataset_type(e.g.2017,exp).- Input locations and filename patterns with wildcards like
{run_id}and{file_id}. - A list of runs or run ranges, their GCD files, and expected file counts.
- For data, the livetime per run (used later for GRL construction).
The DatasetManager (see Processing) interprets these configs and constructs the input/output file lists used by Snakemake.
Dataset YAMLs¶
Each dataset is described by a YAML file under config/datasets/<dataset_type>/. A typical experimental dataset config looks like:
dataset_id: '2017'
dataset_type: exp
dataset_path: exp/IceCube
file_pattern: '{storage_year}/filtered/level2/{storage_date}/{run_id}/Level2_IC86.2017_data_{run_id}_Subrun00000000_{file_id}.i3.zst'
exclude: '_IT.i3.zst'
runs:
- run_id: Run00129530
gcd_file: exp/IceCube/2017/filtered/level2/0520/Run00129530/Level2_IC86.2017_data_Run00129530_0520_63_312_GCD.i3.zst
livetime: 28806.19
storage_year: '2017'
storage_date: '0520'
num_files: 217
Key points:
dataset_idis typically a year (forexp) or a set number (fornugen,corsika).dataset_typematches one of the supported categories:exp,nugen,corsika,muongun.dataset_pathis a base directory relative to the local input base path defined in the cluster config.file_patterncontains wildcards such as{run_id},{file_id}, and possibly additional per-run wildcards like{storage_year}or{storage_date}.runsenumerates individual runs or run ranges and declares:run_id(e.g.Run00134674for experimental data,0000000-0000999for NuGen).gcd_file(per-run or dataset-wide in the YAML header).livetime(for experimental data only, in seconds).num_files: expected number of files for that run/run-range; the workflow will error if the discovered number differs.
The same structure applies to MC configs with appropriate paths and filename patterns.
DatasetManager¶
The DatasetManager class in utils/dataset_manager.py is the central piece that:
- Parses dataset YAMLs.
- Expands file patterns and wildcards for each run.
- Validates that the number of input files matches
num_files. - Decides whether a dataset needs merging (e.g.
exp,corsika,muongun) or not (e.g.nugen). - Constructs the expected output file paths for each workflow step.
Snakemake rules call into the DatasetManager to get lists of inputs and outputs, ensuring a consistent structure across steps.
Local Configuration and Cluster Profiles¶
Local Configs¶
The local configuration files under config/local_configs/ specify site-specific settings, such as:
- Input base directory (where raw Level 2 i3 files live).
- Output base directory (where processed files are written).
- Staging options (whether to copy data to local scratch, which scratch paths to use).
- Paths to containers or environment modules if needed.
Example files include:
config/local_configs/madison.yamlconfig/local_configs/msu.yaml
These files are referenced by the Snakemake profile or environment variable (e.g. via LT_LOCAL_CONFIG) so that the workflow knows where to read and write data.
Snakemake Profiles¶
Cluster-specific behavior is captured in profiles under config/profiles/. For example:
config/profiles/msu/contains a Snakemake profile tailored to the MSU Slurm cluster, including:config.yamlfor default CLI options (e.g.--executor slurm, default configfile, etc.).cluster.yamlor similar for job resource templates (CPU, memory, time).
Using a profile allows you to run the workflow with a concise command while keeping site-specific details out of the main Snakefile.
Running the Workflow¶
Basic Execution¶
To run the workflow without a profile, you can invoke Snakemake directly (adjust paths and options as needed):
snakemake \
--configfile config/datasets/exp/2020.yaml \
--executor slurm \
--jobs 100
With a profile (recommended), the command becomes:
snakemake \
--profile config/profiles/msu \
--configfile config/datasets/exp/2020.yaml \
--jobs 100
You can use the same pattern for other dataset configs under config/datasets/.
Convenience Script¶
The top-level script run_all.sh provides a convenient way to loop over multiple dataset configs (e.g. all exp years) locally or on a development node. A typical invocation is:
./run_all.sh config/datasets/exp/*.yaml
The script uses snakemake --configfile <cfg> -c 128 --rerun-incomplete --keep-going --config start_at_step=7 to quickly regenerate high-level outputs, but you can adjust it as needed.
Workflow Controls¶
The workflow exposes several config variables and meta-rules that make large-scale processing manageable.
Start-at-Step¶
To speed up DAG creation or skip earlier steps that are already complete, you can start at a later step via start_at_step:
snakemake \
--profile config/profiles/msu \
--configfile config/datasets/exp/2020.yaml \
--config start_at_step=6
Step numbers correspond to:
01_filter_models02_RNN_directional_reco03_RNN_energy_reco04_MuEx05_STV06_TNF_reco07_final_cut_models08_hdf_export
This is particularly useful when iterating on the later steps (e.g. TNF or final cuts) without having to re-run filters and earlier reconstructions.
Run Batching¶
For large datasets, you can process runs in batches using run_batch_size and run_batch_id in the Snakemake config context:
snakemake \
--profile config/profiles/msu \
--configfile config/datasets/corsika/20904.yaml \
--config run_batch_size=300 run_batch_id=0
run_batch_sizesets how many runs are included in each batch.run_batch_id(0-based integer) selects which batch to process.
This reduces DAG size and memory usage for very large datasets.
Meta-Rules¶
For each step, the workflow automatically defines meta-rules that make it easy to request subsets of outputs:
benchmark_<step>: produce only the first missing output for a given step; useful for quick end-to-end timing.<step>: request all outputs of that step.run_missing_<step>: request only outputs that are missing on disk.
There is also a special meta-rule run_missing_final that requests only missing final HDF5 outputs (08_hdf_export).
Use run_missing_* with some care: files might exist but be incomplete if a previous run was interrupted. When in doubt, prefer --rerun-incomplete.
Output Layout and Merging¶
Output Directory Structure¶
All outputs are written under the output_base_dir specified in your local config. The generic pattern is:
{output_base_dir}/{step}/{dataset_type}/{dataset_id}/...
For a step like 06_TNF_reco, this becomes:
-
For merged datasets (
needs_merging = true, e.g.exp,corsika,muongun):text {output_base_dir}/06_TNF_reco/{dataset_type}/{dataset_id}/{RunID}.i3.zst -
For non-merged datasets (
needs_merging = false, e.g.nugen):text {output_base_dir}/06_TNF_reco/{dataset_type}/{dataset_id}/{RunID}/Level2_..._{file_id}.i3.zst
The HDF5 export step (08_hdf_export) uses the same pattern but with .h5 files.
Merging Behavior¶
-
Experimental, CORSIKA, MuonGun (
needs_merging = true):- After the initial starting-track model cut, i3 files for each run are merged into a single i3 file per run.
- Subsequent steps operate on these merged files, reducing file count and simplifying bookkeeping.
-
NuGen (
needs_merging = false):- Files remain unmerged through most of the workflow because a large fraction of events survive the initial cuts.
- Merging to dataset-level happens later at the HDF/NumPy stage.
The final HDF5 export writes:
- One file per run for merged datasets (
exp,corsika,muongun). - One file per input file for non-merged datasets (
nugen).
Currently there is no dataset-level HDF5 merge in the workflow; merging happens in the analysis notebooks.
High-Level Postprocessing (Notebooks)¶
Two Jupyter notebooks in notebooks/ handle the transition from per-run HDF5 to analysis-ready arrays and Csky samples.
1. HDF → Merged NumPy (load_hdf.ipynb)¶
Inputs
- HDF5 files from
output_base_dir/08_hdf_export/{dataset_type}/{dataset_id}/.../*.h5. - Dataset configs from
config/datasets/.
Outputs
- Experimental data:
09_merged_npy/exp/{YYYY}.npy09_merged_npy/exp/livetimes.npy
- NuGen:
09_merged_npy/nugen/{dataset_id}.npyper set.
- CORSIKA:
09_merged_npy/corsika/{dataset_id}.npyper set.
What it does
- Loads selected keys from HDF and flattens them to structured NumPy columns (LCSC scores, RNN/TNF outputs, STV variables, MuEx energy, RA/Dec, etc.).
- For NuGen:
- Re-normalizes
OneWeightbyNEventsand optionally includes classification labels.
- Re-normalizes
- For CORSIKA:
- Computes
CorsikaWeightviasimweights(GlobalFitGST by default) and normalizes per-file counts.
- Computes
- Ensures STV variables exist (fills with -1 where not applicable).
- Merges per-run/per-file HDFs into per-year (exp) or per-dataset (MC) structured arrays.
Notes
- The notebook can be run on any node with sufficient memory and Python environment.
- It typically reads a local config (e.g.
LT_LOCAL_CONFIG) to inferoutput_base_dirand dataset paths. - Threading can be controlled via an internal
NUM_THREADSparameter.
2. Merged NumPy → Csky NPY (csky_export.ipynb)¶
Inputs
- Merged NPYs from
09_merged_npy/exp,09_merged_npy/nugen. - DNNC sample reference NPYs (for overlap removal), if available.
Outputs
- Versioned Csky NPY samples under
10_csky_npy/{VERSION}-startingand10_csky_npy/{VERSION}-throughgoing. - DNNC-removed variants under
10_csky_npy/{VERSION}-starting-dnnc-removedand10_csky_npy/{VERSION}-throughgoing-dnnc-removed. - A helper CSV
10_csky_npy/{VERSION}_runs.csvlisting runs and seasons for GRL creation.
What it does
- Applies the final selection: quality cuts plus thresholds on starting/throughgoing MLP scores.
- Computes a simple energy-dependent pull correction for TNF sigma:
- Fits splines to the median pull vs. MuEx energy for starting and throughgoing samples.
- Defines
tnf_sigma_corrected, used as the CskyangErrfield.
- Maps columns to Csky’s schema:
run, event, subevent, time (MJD), logE, ra, dec, angErr, plus MC truth. - Removes overlaps with DNNC samples using event-level joins where reference files exist.
- Writes final NPY files in versioned subdirectories.
GRL Generation
- A separate tool (
nusources_dataset_converters/grl) can be used to build official GRLs from the runs CSV. - The notebook typically prints example commands for
create_dataset_grl.pywith the correct paths.
Logging and Diagnostics¶
Snakemake automatically stores logs and metadata in its .snakemake directory. Individual rules also write log files to step-specific locations (as configured in the Snakefile). There is no need for a top-level logs/ directory.
For debugging:
- Use
-n/--dry-runto inspect the DAG and planned jobs. - Increase verbosity with
-p(print shell commands) and-r(print reasons for rule execution). - Use
--summaryand--detailed-summaryto inspect file completion status.
Extending the Workflow¶
To add a new processing step or modify an existing one:
-
Add or modify a rule in
Snakefile- Define clear input and output patterns that fit into the existing
{step}/{dataset_type}/{dataset_id}/...layout. - Reuse
DatasetManagerhelper functions wherever possible to avoid duplicating logic.
- Define clear input and output patterns that fit into the existing
-
Implement or update a script in
processing_scripts/- Keep dependencies minimal; prefer using existing containers and environment definitions.
-
Update profiles or local configs if resources change
- For example, adjust CPU/GPU requests or memory in
config/profiles/msu/.
- For example, adjust CPU/GPU requests or memory in
-
Test with a small dataset
- Use a single run or a small batch with
run_batch_sizeto validate behavior before scaling up.
- Use a single run or a small batch with
This separation between selection logic and processing infrastructure should make it straightforward to evolve the workflow as models and analysis needs develop.