Skip to content

Processed Datasets

This page summarizes all datasets processed for the Lightning Tracks selection and how each is used throughout this documentation. All processed datasets are defined via YAML files in config/datasets/, structured as follows:

  • Experimental data (exp/)config/datasets/exp/
  • NuGen simulation (nugen/)config/datasets/nugen/baseline/ and config/datasets/nugen/systematics/
  • CORSIKA (corsika/)config/datasets/corsika/ (not yet used downstream in LT results)

Experimental Data

SLT refers to Starting Lightning Tracks, and TLT to Throughgoing Lightning Tracks. Table 1 lists each processed season (2011–2022) along with run counts, livetimes, and event statistics at both filter and final-selection level.

Table 1
Season Runs /
Livetime (days)
Sample Filter events Filter rate (mHz) Final events Final rate (mHz)
2011 1101 SLT 591665 20.582 49792 1.732
332.71 TLT 1955185 68.015 174381 6.066
2012 1089 SLT 579779 20.738 48454 1.734
323.58 TLT 1967744 70.384 172762 6.183
2013 1189 SLT 607606 20.598 51421 1.743
341.42 TLT 2058149 69.772 181755 6.162
2014 1198 SLT 641979 20.612 54167 1.739
360.48 TLT 2156126 69.227 191028 6.133
2015 1176 SLT 641851 20.529 54274 1.736
361.87 TLT 2146815 68.665 192009 6.141
2016 1116 SLT 633230 20.714 53876 1.762
353.82 TLT 2119401 69.329 187941 6.148
2017 1305 SLT 721487 20.545 60431 1.721
406.45 TLT 2386987 67.972 214839 6.118
2018 1172 SLT 649314 20.568 54719 1.733
365.39 TLT 2165173 68.584 193568 6.131
2019 979 SLT 558449 21.087 46892 1.771
306.51 TLT 1885796 71.209 164445 6.210
2020 1111 SLT 641663 20.630 53745 1.728
359.99 TLT 2113250 67.943 190030 6.110
2021 1340 SLT 756587 20.472 64498 1.745
427.74 TLT 2455923 66.454 223988 6.061
2022 1503 SLT 851663 20.588 72127 1.744
478.79 TLT 2788839 67.416 252463 6.103
All 14279 SLT 7875273 20.628 664396 1.740
4418.76 TLT 26199388 68.624 2339209 6.127
Table 1: Experimental SLT and TLT datasets used in the Lightning Tracks selection, with seasonal livetime and event statistics.

Simulation

NuGen

Baseline Production

Our primary baseline simulation is the Snowstorm NuGen production from Simulation Request 13C4, shown in Table 2. These samples were used to train the TNF reconstruction model and were additionally split 50/50 into training and validation sets for the Lightning Tracks final-cut models.

Because TNF was trained on these events, they are considered burned for physics studies, meaning they should not be used to generate Csky NumPy files for sensitivity or performance evaluations.

The remaining held-out half of the datasets continues to serve as the baseline for Data–MC Agreement validation plots.

Table 2
Flavor \(10^2\)\(10^4\) GeV \(10^4\)\(10^6\) GeV \(10^6\)\(10^8\) GeV
\(\nu_e\) 22614 (E\(^{-1.5}\)) 22613 (E\(^{-1.5}\)) 22612 (E\(^{-1}\))
\(\nu_\mu\) 22646 (E\(^{-1.5}\)) 22645 (E\(^{-1.5}\)) 22644 (E\(^{-1}\))
\(\nu_\tau\) 22633 (E\(^{-1.5}\)) 22634 (E\(^{-1.5}\)) 22635 (E\(^{-1}\))
Table 2: Baseline NuGen Snowstorm production (SPICE FTP-v3) used for TNF and final-cut training/validation; dataset IDs with spectral slopes shown in parentheses.

Systematics—Nominal Ensemble

Our systematics studies rely primarily on the Snowstorm ensemble sets from Simulation Request 13D7 (see Table 3). These are the nominal NuGen samples used to produce the Csky signal MC NumPy files, and therefore form the foundation of all Lightning Tracks performance results—including sensitivities, effective areas, PSF calibration, and Data–MC agreement.

Table 3
Flavor 20–100 GeV \(10^2\)\(10^4\) GeV \(10^4\)\(10^6\) GeV \(10^6\)\(10^8\) GeV
\(\nu_\mu\) 22861 (E\(^{-1.5}\)) 22852 (E\(^{-1.5}\)) 22853 (E\(^{-1.5}\)) 22854 (E\(^{-1}\))
\(\nu_e\) 22855 (E\(^{-1.5}\)) 22856 (E\(^{-1.5}\)) 22857 (E\(^{-1}\))
\(\nu_\tau\) 22858 (E\(^{-1.5}\)) 22859 (E\(^{-1.5}\)) 22860 (E\(^{-1}\))
Table 3: NuGen Snowstorm ensemble (SPICE FTP-v3) used as nominal signal MC for Csky performance files; dataset IDs with spectral slopes shown in parentheses.

Systematics—Hole Ice Variation

To evaluate hole-ice uncertainties, we processed an additional Snowstorm variant (SimReq 13CC), summarized in Table 4. These sets use unified hole-ice parameters (p0 = −0.27, p1 = −0.042) and serve as a controlled deviation from the nominal 13D7 ensemble.

Table 4
Flavor \(10^2\)\(10^4\) GeV \(10^4\)\(10^6\) GeV \(10^6\)\(10^8\) GeV
\(\nu_\mu\) 22684 (E\(^{-1.5}\)) 22685 (E\(^{-1.5}\)) 22686 (E\(^{-1}\))
\(\nu_e\) 22687 (E\(^{-1.5}\)) 22688 (E\(^{-1.5}\)) 22689 (E\(^{-1}\))
\(\nu_\tau\) 22690 (E\(^{-1.5}\)) 22691 (E\(^{-1.5}\)) 22692 (E\(^{-1}\))
Table 4: Snowstorm off-baseline hole-ice systematics (IC86-2023, SPICE FTP-v3) using unified hole-ice p0/p1; dataset IDs with spectral slopes shown in parentheses.

Systematics—Historical / Legacy NuGen

We also make use of the legacy ESTES NuGen production from SimReq 13AB, shown in Table 5. This set is not a systematics ensemble by construction—it was originally generated as a baseline under SPICE 3.2.1 using an older processing chain. We treat it as a systematic variation to evaluate the impact of outdated simulation and reconstruction infrastructure. In particular, we used it to assess the effect of the Calibration Errata Bug and, more broadly, to understand how Lightning Tracks responds to “ancient” processing conditions.

Table 5
Flavor \(10^2\)\(10^4\) GeV \(10^4\)\(10^6\) GeV \(10^6\)\(10^8\) GeV
\(\nu_\mu\) 21813 (E\(^{-2}\)) 21814 (E\(^{-1.5}\)) 21938 (E\(^{-1}\))
\(\nu_e\) 21871 (E\(^{-1.5}\))
\(\nu_\tau\) 21867 (E\(^{-2}\)) 21868 (E\(^{-1.5}\)) 21939 (E\(^{-1}\))
Table 5: Legacy ESTES NuGen baseline (IC86-2020, SPICE 3.2.1) repurposed as a systematics test for old simulation software; dataset IDs with spectral slopes shown in parentheses.

CORSIKA

For CORSIKA, we rely almost exclusively on the 20904 production—at present the only sample with sufficient statistics for meaningful background modeling, especially in the low‐energy regime. This set is somewhat old and was generated using the SPICE 3.2 ice model, but despite its age it remains the only viable high-statistics muon bundle simulation available to us.

The first half of 20904 (split by run number) was used to train all classifier models that depend on atmospheric muon background—namely the final-cut classifiers and the downgoing–throughgoing filter network. The second half is held out for validation and forms the dataset used in our Data–MC Agreement comparisons.