Processing Workflow¶

Work in Progress

For now this is largely copied from the old README.

This section documents the technical implementation of the Lightning Tracks processing chain: how Snakemake organizes the steps, how datasets are described, where outputs go, and how to control and run the workflow on different systems.

For the physics narrative of the selection itself (why a particular cut or model was chosen, how the final performance looks), see Selection instead.

Workflow Overview¶

The LT processing pipeline is implemented as a Snakemake workflow driven by the top-level Snakefile. It:

Applies ML filter models and reconstructions to IceCube Level 2 data.
Manages dependencies between steps via file-based rules.
Supports multiple dataset types (exp, nugen, corsika, muongun).
Runs on different backends (local execution, Slurm clusters, HTCondor via profiles).
- If specified, this includes staging the input data from the Madison file system by temporarily downloading files as needed and eventually deleting them once processed.

At a high level, the workflow steps are:

01_filter_models: apply CNN/MLP filter models to Level 2 events.
02_RNN_directional_reco: run RNN directional reconstruction.
03_RNN_energy_reco: run RNN energy reconstruction (optional for physics).
04_MuEx: run MuEx energy estimation on reconstructed tracks.
05_STV: run ESTES Starting Track Veto for starting-track candidates.
06_TNF_reco: run Normalizing Flow (TNF) directional reconstruction.
07_final_cut_models: evaluate final starting/throughgoing MLPs.
08_hdf_export: export key variables to per-run HDF5.

The postprocessing from HDF5 to merged NumPy and Csky NPY is handled by notebooks and described below.

Repository Layout (Processing-Relevant Parts)¶

The following directories and files are central to the workflow implementation:

Snakefile: main workflow file defining rules and meta-rules.
config/
- datasets/: YAML dataset definitions, grouped by dataset type.
  - exp/: experimental data configs (per year).
  - nugen/: NuGen simulation configs (per set).
  - corsika/: CORSIKA simulation configs.
- local_configs/: environment-specific configs (e.g. madison.yaml, msu.yaml) that define paths and staging behavior.
- profiles/: cluster profiles for Snakemake (e.g. profiles/msu/ for Slurm on MSU HPCC).
processing_scripts/: Python scripts called by Snakemake rules to execute individual steps (filters, reconstructions, exports, etc.).
models/: on-disk locations of pre-trained ML models (not version-controlled, assumed to be present at runtime).
utils/dataset_manager.py: core logic for interpreting dataset configs and constructing lists of input and output files.

The rest of the repository (e.g. notebooks/, docs/) is used for postprocessing and documentation.

Dataset Configuration and Management¶

Datasets and Samples¶

The workflow supports several dataset categories, each described by YAML configs under config/datasets/:

Experimental data (exp) – yearly Level 2 data (e.g. 2017, 2018, …), with runs, GCDs, and livetimes explicitly listed.
Neutrino-Generator (nugen) – signal-dominated neutrino simulations, organized by set ID.
CORSIKA (corsika) – atmospheric muon background simulations.
MuonGun (muongun, optional) – alternative atmospheric muon generator; currently not central but supported by the same machinery.

Each config specifies:

dataset_id and dataset_type (e.g. 2017, exp).
Input locations and filename patterns with wildcards like {run_id} and {file_id}.
A list of runs or run ranges, their GCD files, and expected file counts.
For data, the livetime per run (used later for GRL construction).

The DatasetManager (see Processing) interprets these configs and constructs the input/output file lists used by Snakemake.

Dataset YAMLs¶

Each dataset is described by a YAML file under config/datasets/<dataset_type>/. A typical experimental dataset config looks like:

dataset_id: '2017'
dataset_type: exp
dataset_path: exp/IceCube
file_pattern: '{storage_year}/filtered/level2/{storage_date}/{run_id}/Level2_IC86.2017_data_{run_id}_Subrun00000000_{file_id}.i3.zst'
exclude: '_IT.i3.zst'
runs:
    - run_id: Run00129530
        gcd_file: exp/IceCube/2017/filtered/level2/0520/Run00129530/Level2_IC86.2017_data_Run00129530_0520_63_312_GCD.i3.zst
        livetime: 28806.19
        storage_year: '2017'
        storage_date: '0520'
        num_files: 217

Key points:

dataset_id is typically a year (for exp) or a set number (for nugen, corsika).
dataset_type matches one of the supported categories: exp, nugen, corsika, muongun.
dataset_path is a base directory relative to the local input base path defined in the cluster config.
file_pattern contains wildcards such as {run_id}, {file_id}, and possibly additional per-run wildcards like {storage_year} or {storage_date}.
runs enumerates individual runs or run ranges and declares:
- run_id (e.g. Run00134674 for experimental data, 0000000-0000999 for NuGen).
- gcd_file (per-run or dataset-wide in the YAML header).
- livetime (for experimental data only, in seconds).
- num_files: expected number of files for that run/run-range; the workflow will error if the discovered number differs.

The same structure applies to MC configs with appropriate paths and filename patterns.

DatasetManager¶

The DatasetManager class in utils/dataset_manager.py is the central piece that:

Parses dataset YAMLs.
Expands file patterns and wildcards for each run.
Validates that the number of input files matches num_files.
Decides whether a dataset needs merging (e.g. exp, corsika, muongun) or not (e.g. nugen).
Constructs the expected output file paths for each workflow step.

Snakemake rules call into the DatasetManager to get lists of inputs and outputs, ensuring a consistent structure across steps.

Local Configuration and Cluster Profiles¶

Local Configs¶

The local configuration files under config/local_configs/ specify site-specific settings, such as:

Input base directory (where raw Level 2 i3 files live).
Output base directory (where processed files are written).
Staging options (whether to copy data to local scratch, which scratch paths to use).
Paths to containers or environment modules if needed.

Example files include:

config/local_configs/madison.yaml
config/local_configs/msu.yaml

These files are referenced by the Snakemake profile or environment variable (e.g. via LT_LOCAL_CONFIG) so that the workflow knows where to read and write data.

Snakemake Profiles¶

Cluster-specific behavior is captured in profiles under config/profiles/. For example:

config/profiles/msu/ contains a Snakemake profile tailored to the MSU Slurm cluster, including:
- config.yaml for default CLI options (e.g. --executor slurm, default configfile, etc.).
- cluster.yaml or similar for job resource templates (CPU, memory, time).

Using a profile allows you to run the workflow with a concise command while keeping site-specific details out of the main Snakefile.

Running the Workflow¶

Basic Execution¶

To run the workflow without a profile, you can invoke Snakemake directly (adjust paths and options as needed):

snakemake \
    --configfile config/datasets/exp/2020.yaml \
    --executor slurm \
    --jobs 100

With a profile (recommended), the command becomes:

snakemake \
    --profile config/profiles/msu \
    --configfile config/datasets/exp/2020.yaml \
    --jobs 100

You can use the same pattern for other dataset configs under config/datasets/.

Convenience Script¶

The top-level script run_all.sh provides a convenient way to loop over multiple dataset configs (e.g. all exp years) locally or on a development node. A typical invocation is:

./run_all.sh config/datasets/exp/*.yaml

The script uses snakemake --configfile <cfg> -c 128 --rerun-incomplete --keep-going --config start_at_step=7 to quickly regenerate high-level outputs, but you can adjust it as needed.

Workflow Controls¶

The workflow exposes several config variables and meta-rules that make large-scale processing manageable.

Start-at-Step¶

To speed up DAG creation or skip earlier steps that are already complete, you can start at a later step via start_at_step:

snakemake \
    --profile config/profiles/msu \
    --configfile config/datasets/exp/2020.yaml \
    --config start_at_step=6

Step numbers correspond to:

01_filter_models
02_RNN_directional_reco
03_RNN_energy_reco
04_MuEx
05_STV
06_TNF_reco
07_final_cut_models
08_hdf_export

This is particularly useful when iterating on the later steps (e.g. TNF or final cuts) without having to re-run filters and earlier reconstructions.

Run Batching¶

For large datasets, you can process runs in batches using run_batch_size and run_batch_id in the Snakemake config context:

snakemake \
    --profile config/profiles/msu \
    --configfile config/datasets/corsika/20904.yaml \
    --config run_batch_size=300 run_batch_id=0

run_batch_size sets how many runs are included in each batch.
run_batch_id (0-based integer) selects which batch to process.

This reduces DAG size and memory usage for very large datasets.

Meta-Rules¶

For each step, the workflow automatically defines meta-rules that make it easy to request subsets of outputs:

benchmark_<step>: produce only the first missing output for a given step; useful for quick end-to-end timing.
<step>: request all outputs of that step.
run_missing_<step>: request only outputs that are missing on disk.

There is also a special meta-rule run_missing_final that requests only missing final HDF5 outputs (08_hdf_export).

Use run_missing_* with some care: files might exist but be incomplete if a previous run was interrupted. When in doubt, prefer --rerun-incomplete.

Output Layout and Merging¶

Output Directory Structure¶

All outputs are written under the output_base_dir specified in your local config. The generic pattern is:

{output_base_dir}/{step}/{dataset_type}/{dataset_id}/...

For a step like 06_TNF_reco, this becomes:

For merged datasets (needs_merging = true, e.g. exp, corsika, muongun):

text {output_base_dir}/06_TNF_reco/{dataset_type}/{dataset_id}/{RunID}.i3.zst
For non-merged datasets (needs_merging = false, e.g. nugen):

text {output_base_dir}/06_TNF_reco/{dataset_type}/{dataset_id}/{RunID}/Level2_..._{file_id}.i3.zst

The HDF5 export step (08_hdf_export) uses the same pattern but with .h5 files.

Merging Behavior¶

Experimental, CORSIKA, MuonGun (needs_merging = true):
- After the initial starting-track model cut, i3 files for each run are merged into a single i3 file per run.
- Subsequent steps operate on these merged files, reducing file count and simplifying bookkeeping.
NuGen (needs_merging = false):
- Files remain unmerged through most of the workflow because a large fraction of events survive the initial cuts.
- Merging to dataset-level happens later at the HDF/NumPy stage.

The final HDF5 export writes:

One file per run for merged datasets (exp, corsika, muongun).
One file per input file for non-merged datasets (nugen).

Currently there is no dataset-level HDF5 merge in the workflow; merging happens in the analysis notebooks.

High-Level Postprocessing (Notebooks)¶

Two Jupyter notebooks in notebooks/ handle the transition from per-run HDF5 to analysis-ready arrays and Csky samples.

1. HDF → Merged NumPy (`load_hdf.ipynb`)¶

Inputs

HDF5 files from output_base_dir/08_hdf_export/{dataset_type}/{dataset_id}/.../*.h5.
Dataset configs from config/datasets/.

Outputs

Experimental data:
- 09_merged_npy/exp/{YYYY}.npy
- 09_merged_npy/exp/livetimes.npy
NuGen:
- 09_merged_npy/nugen/{dataset_id}.npy per set.
CORSIKA:
- 09_merged_npy/corsika/{dataset_id}.npy per set.

What it does

Loads selected keys from HDF and flattens them to structured NumPy columns (LCSC scores, RNN/TNF outputs, STV variables, MuEx energy, RA/Dec, etc.).
For NuGen:
- Re-normalizes OneWeight by NEvents and optionally includes classification labels.
For CORSIKA:
- Computes CorsikaWeight via simweights (GlobalFitGST by default) and normalizes per-file counts.
Ensures STV variables exist (fills with -1 where not applicable).
Merges per-run/per-file HDFs into per-year (exp) or per-dataset (MC) structured arrays.

Notes

The notebook can be run on any node with sufficient memory and Python environment.
It typically reads a local config (e.g. LT_LOCAL_CONFIG) to infer output_base_dir and dataset paths.
Threading can be controlled via an internal NUM_THREADS parameter.

2. Merged NumPy → Csky NPY (`csky_export.ipynb`)¶