Skip to content

Processing Workflow

Work in Progress

For now this is largely copied from the old README.

This section documents the technical implementation of the Lightning Tracks processing chain: how Snakemake organizes the steps, how datasets are described, where outputs go, and how to control and run the workflow on different systems.

For the physics narrative of the selection itself (why a particular cut or model was chosen, how the final performance looks), see Selection instead.

Workflow Overview

The LT processing pipeline is implemented as a Snakemake workflow driven by the top-level Snakefile. It:

  • Applies ML filter models and reconstructions to IceCube Level 2 data.
  • Manages dependencies between steps via file-based rules.
  • Supports multiple dataset types (exp, nugen, corsika, muongun).
  • Runs on different backends (local execution, Slurm clusters, HTCondor via profiles).
    • If specified, this includes staging the input data from the Madison file system by temporarily downloading files as needed and eventually deleting them once processed.

At a high level, the workflow steps are:

  1. 01_filter_models: apply CNN/MLP filter models to Level 2 events.
  2. 02_RNN_directional_reco: run RNN directional reconstruction.
  3. 03_RNN_energy_reco: run RNN energy reconstruction (optional for physics).
  4. 04_MuEx: run MuEx energy estimation on reconstructed tracks.
  5. 05_STV: run ESTES Starting Track Veto for starting-track candidates.
  6. 06_TNF_reco: run Normalizing Flow (TNF) directional reconstruction.
  7. 07_final_cut_models: evaluate final starting/throughgoing MLPs.
  8. 08_hdf_export: export key variables to per-run HDF5.

The postprocessing from HDF5 to merged NumPy and Csky NPY is handled by notebooks and described below.

Repository Layout (Processing-Relevant Parts)

The following directories and files are central to the workflow implementation:

  • Snakefile: main workflow file defining rules and meta-rules.
  • config/
    • datasets/: YAML dataset definitions, grouped by dataset type.
      • exp/: experimental data configs (per year).
      • nugen/: NuGen simulation configs (per set).
      • corsika/: CORSIKA simulation configs.
    • local_configs/: environment-specific configs (e.g. madison.yaml, msu.yaml) that define paths and staging behavior.
    • profiles/: cluster profiles for Snakemake (e.g. profiles/msu/ for Slurm on MSU HPCC).
  • processing_scripts/: Python scripts called by Snakemake rules to execute individual steps (filters, reconstructions, exports, etc.).
  • models/: on-disk locations of pre-trained ML models (not version-controlled, assumed to be present at runtime).
  • utils/dataset_manager.py: core logic for interpreting dataset configs and constructing lists of input and output files.

The rest of the repository (e.g. notebooks/, docs/) is used for postprocessing and documentation.

Dataset Configuration and Management

Datasets and Samples

The workflow supports several dataset categories, each described by YAML configs under config/datasets/:

  • Experimental data (exp) – yearly Level 2 data (e.g. 2017, 2018, …), with runs, GCDs, and livetimes explicitly listed.
  • Neutrino-Generator (nugen) – signal-dominated neutrino simulations, organized by set ID.
  • CORSIKA (corsika) – atmospheric muon background simulations.
  • MuonGun (muongun, optional) – alternative atmospheric muon generator; currently not central but supported by the same machinery.

Each config specifies:

  • dataset_id and dataset_type (e.g. 2017, exp).
  • Input locations and filename patterns with wildcards like {run_id} and {file_id}.
  • A list of runs or run ranges, their GCD files, and expected file counts.
  • For data, the livetime per run (used later for GRL construction).

The DatasetManager (see Processing) interprets these configs and constructs the input/output file lists used by Snakemake.

Dataset YAMLs

Each dataset is described by a YAML file under config/datasets/<dataset_type>/. A typical experimental dataset config looks like:

dataset_id: '2017'
dataset_type: exp
dataset_path: exp/IceCube
file_pattern: '{storage_year}/filtered/level2/{storage_date}/{run_id}/Level2_IC86.2017_data_{run_id}_Subrun00000000_{file_id}.i3.zst'
exclude: '_IT.i3.zst'
runs:
    - run_id: Run00129530
        gcd_file: exp/IceCube/2017/filtered/level2/0520/Run00129530/Level2_IC86.2017_data_Run00129530_0520_63_312_GCD.i3.zst
        livetime: 28806.19
        storage_year: '2017'
        storage_date: '0520'
        num_files: 217

Key points:

  • dataset_id is typically a year (for exp) or a set number (for nugen, corsika).
  • dataset_type matches one of the supported categories: exp, nugen, corsika, muongun.
  • dataset_path is a base directory relative to the local input base path defined in the cluster config.
  • file_pattern contains wildcards such as {run_id}, {file_id}, and possibly additional per-run wildcards like {storage_year} or {storage_date}.
  • runs enumerates individual runs or run ranges and declares:
    • run_id (e.g. Run00134674 for experimental data, 0000000-0000999 for NuGen).
    • gcd_file (per-run or dataset-wide in the YAML header).
    • livetime (for experimental data only, in seconds).
    • num_files: expected number of files for that run/run-range; the workflow will error if the discovered number differs.

The same structure applies to MC configs with appropriate paths and filename patterns.

DatasetManager

The DatasetManager class in utils/dataset_manager.py is the central piece that:

  • Parses dataset YAMLs.
  • Expands file patterns and wildcards for each run.
  • Validates that the number of input files matches num_files.
  • Decides whether a dataset needs merging (e.g. exp, corsika, muongun) or not (e.g. nugen).
  • Constructs the expected output file paths for each workflow step.

Snakemake rules call into the DatasetManager to get lists of inputs and outputs, ensuring a consistent structure across steps.

Local Configuration and Cluster Profiles

Local Configs

The local configuration files under config/local_configs/ specify site-specific settings, such as:

  • Input base directory (where raw Level 2 i3 files live).
  • Output base directory (where processed files are written).
  • Staging options (whether to copy data to local scratch, which scratch paths to use).
  • Paths to containers or environment modules if needed.

Example files include:

  • config/local_configs/madison.yaml
  • config/local_configs/msu.yaml

These files are referenced by the Snakemake profile or environment variable (e.g. via LT_LOCAL_CONFIG) so that the workflow knows where to read and write data.

Snakemake Profiles

Cluster-specific behavior is captured in profiles under config/profiles/. For example:

  • config/profiles/msu/ contains a Snakemake profile tailored to the MSU Slurm cluster, including:
    • config.yaml for default CLI options (e.g. --executor slurm, default configfile, etc.).
    • cluster.yaml or similar for job resource templates (CPU, memory, time).

Using a profile allows you to run the workflow with a concise command while keeping site-specific details out of the main Snakefile.

Running the Workflow

Basic Execution

To run the workflow without a profile, you can invoke Snakemake directly (adjust paths and options as needed):

snakemake \
    --configfile config/datasets/exp/2020.yaml \
    --executor slurm \
    --jobs 100

With a profile (recommended), the command becomes:

snakemake \
    --profile config/profiles/msu \
    --configfile config/datasets/exp/2020.yaml \
    --jobs 100

You can use the same pattern for other dataset configs under config/datasets/.

Convenience Script

The top-level script run_all.sh provides a convenient way to loop over multiple dataset configs (e.g. all exp years) locally or on a development node. A typical invocation is:

./run_all.sh config/datasets/exp/*.yaml

The script uses snakemake --configfile <cfg> -c 128 --rerun-incomplete --keep-going --config start_at_step=7 to quickly regenerate high-level outputs, but you can adjust it as needed.

Workflow Controls

The workflow exposes several config variables and meta-rules that make large-scale processing manageable.

Start-at-Step

To speed up DAG creation or skip earlier steps that are already complete, you can start at a later step via start_at_step:

snakemake \
    --profile config/profiles/msu \
    --configfile config/datasets/exp/2020.yaml \
    --config start_at_step=6

Step numbers correspond to:

  1. 01_filter_models
  2. 02_RNN_directional_reco
  3. 03_RNN_energy_reco
  4. 04_MuEx
  5. 05_STV
  6. 06_TNF_reco
  7. 07_final_cut_models
  8. 08_hdf_export

This is particularly useful when iterating on the later steps (e.g. TNF or final cuts) without having to re-run filters and earlier reconstructions.

Run Batching

For large datasets, you can process runs in batches using run_batch_size and run_batch_id in the Snakemake config context:

snakemake \
    --profile config/profiles/msu \
    --configfile config/datasets/corsika/20904.yaml \
    --config run_batch_size=300 run_batch_id=0
  • run_batch_size sets how many runs are included in each batch.
  • run_batch_id (0-based integer) selects which batch to process.

This reduces DAG size and memory usage for very large datasets.

Meta-Rules

For each step, the workflow automatically defines meta-rules that make it easy to request subsets of outputs:

  • benchmark_<step>: produce only the first missing output for a given step; useful for quick end-to-end timing.
  • <step>: request all outputs of that step.
  • run_missing_<step>: request only outputs that are missing on disk.

There is also a special meta-rule run_missing_final that requests only missing final HDF5 outputs (08_hdf_export).

Use run_missing_* with some care: files might exist but be incomplete if a previous run was interrupted. When in doubt, prefer --rerun-incomplete.

Output Layout and Merging

Output Directory Structure

All outputs are written under the output_base_dir specified in your local config. The generic pattern is:

{output_base_dir}/{step}/{dataset_type}/{dataset_id}/...

For a step like 06_TNF_reco, this becomes:

  • For merged datasets (needs_merging = true, e.g. exp, corsika, muongun):

    text {output_base_dir}/06_TNF_reco/{dataset_type}/{dataset_id}/{RunID}.i3.zst

  • For non-merged datasets (needs_merging = false, e.g. nugen):

    text {output_base_dir}/06_TNF_reco/{dataset_type}/{dataset_id}/{RunID}/Level2_..._{file_id}.i3.zst

The HDF5 export step (08_hdf_export) uses the same pattern but with .h5 files.

Merging Behavior

  • Experimental, CORSIKA, MuonGun (needs_merging = true):

    • After the initial starting-track model cut, i3 files for each run are merged into a single i3 file per run.
    • Subsequent steps operate on these merged files, reducing file count and simplifying bookkeeping.
  • NuGen (needs_merging = false):

    • Files remain unmerged through most of the workflow because a large fraction of events survive the initial cuts.
    • Merging to dataset-level happens later at the HDF/NumPy stage.

The final HDF5 export writes:

  • One file per run for merged datasets (exp, corsika, muongun).
  • One file per input file for non-merged datasets (nugen).

Currently there is no dataset-level HDF5 merge in the workflow; merging happens in the analysis notebooks.

High-Level Postprocessing (Notebooks)

Two Jupyter notebooks in notebooks/ handle the transition from per-run HDF5 to analysis-ready arrays and Csky samples.

1. HDF → Merged NumPy (load_hdf.ipynb)

Inputs

  • HDF5 files from output_base_dir/08_hdf_export/{dataset_type}/{dataset_id}/.../*.h5.
  • Dataset configs from config/datasets/.

Outputs

  • Experimental data:
    • 09_merged_npy/exp/{YYYY}.npy
    • 09_merged_npy/exp/livetimes.npy
  • NuGen:
    • 09_merged_npy/nugen/{dataset_id}.npy per set.
  • CORSIKA:
    • 09_merged_npy/corsika/{dataset_id}.npy per set.

What it does

  • Loads selected keys from HDF and flattens them to structured NumPy columns (LCSC scores, RNN/TNF outputs, STV variables, MuEx energy, RA/Dec, etc.).
  • For NuGen:
    • Re-normalizes OneWeight by NEvents and optionally includes classification labels.
  • For CORSIKA:
    • Computes CorsikaWeight via simweights (GlobalFitGST by default) and normalizes per-file counts.
  • Ensures STV variables exist (fills with -1 where not applicable).
  • Merges per-run/per-file HDFs into per-year (exp) or per-dataset (MC) structured arrays.

Notes

  • The notebook can be run on any node with sufficient memory and Python environment.
  • It typically reads a local config (e.g. LT_LOCAL_CONFIG) to infer output_base_dir and dataset paths.
  • Threading can be controlled via an internal NUM_THREADS parameter.

2. Merged NumPy → Csky NPY (csky_export.ipynb)

Inputs

  • Merged NPYs from 09_merged_npy/exp, 09_merged_npy/nugen.
  • DNNC sample reference NPYs (for overlap removal), if available.

Outputs

  • Versioned Csky NPY samples under 10_csky_npy/{VERSION}-starting and 10_csky_npy/{VERSION}-throughgoing.
  • DNNC-removed variants under 10_csky_npy/{VERSION}-starting-dnnc-removed and 10_csky_npy/{VERSION}-throughgoing-dnnc-removed.
  • A helper CSV 10_csky_npy/{VERSION}_runs.csv listing runs and seasons for GRL creation.

What it does

  • Applies the final selection: quality cuts plus thresholds on starting/throughgoing MLP scores.
  • Computes a simple energy-dependent pull correction for TNF sigma:
    • Fits splines to the median pull vs. MuEx energy for starting and throughgoing samples.
    • Defines tnf_sigma_corrected, used as the Csky angErr field.
  • Maps columns to Csky’s schema: run, event, subevent, time (MJD), logE, ra, dec, angErr, plus MC truth.
  • Removes overlaps with DNNC samples using event-level joins where reference files exist.
  • Writes final NPY files in versioned subdirectories.

GRL Generation

  • A separate tool (nusources_dataset_converters/grl) can be used to build official GRLs from the runs CSV.
  • The notebook typically prints example commands for create_dataset_grl.py with the correct paths.

Logging and Diagnostics

Snakemake automatically stores logs and metadata in its .snakemake directory. Individual rules also write log files to step-specific locations (as configured in the Snakefile). There is no need for a top-level logs/ directory.

For debugging:

  • Use -n / --dry-run to inspect the DAG and planned jobs.
  • Increase verbosity with -p (print shell commands) and -r (print reasons for rule execution).
  • Use --summary and --detailed-summary to inspect file completion status.

Extending the Workflow

To add a new processing step or modify an existing one:

  1. Add or modify a rule in Snakefile

    • Define clear input and output patterns that fit into the existing {step}/{dataset_type}/{dataset_id}/... layout.
    • Reuse DatasetManager helper functions wherever possible to avoid duplicating logic.
  2. Implement or update a script in processing_scripts/

    • Keep dependencies minimal; prefer using existing containers and environment definitions.
  3. Update profiles or local configs if resources change

    • For example, adjust CPU/GPU requests or memory in config/profiles/msu/.
  4. Test with a small dataset

    • Use a single run or a small batch with run_batch_size to validate behavior before scaling up.

This separation between selection logic and processing infrastructure should make it straightforward to evolve the workflow as models and analysis needs develop.