Processed Datasets¶

This page summarizes all datasets processed for the Lightning Tracks selection and how each is used throughout this documentation. All processed datasets are defined via YAML files in config/datasets/, structured as follows:

Experimental data (exp/)—config/datasets/exp/
NuGen simulation (nugen/)—config/datasets/nugen/baseline/ and config/datasets/nugen/systematics/
CORSIKA (corsika/)—config/datasets/corsika/ (not yet used downstream in LT results)

Experimental Data¶

SLT refers to Starting Lightning Tracks, and TLT to Throughgoing Lightning Tracks. Table 1 lists each processed season (2011–2022) along with run counts, livetimes, and event statistics at both filter and final-selection level.

Table 1

Season	Runs / Livetime (days)	Sample	Filter events	Filter rate (mHz)	Final events	Final rate (mHz)
2011	1101	SLT	591665	20.582	49792	1.732
	332.71	TLT	1955185	68.015	174381	6.066
2012	1089	SLT	579779	20.738	48454	1.734
	323.58	TLT	1967744	70.384	172762	6.183
2013	1189	SLT	607606	20.598	51421	1.743
	341.42	TLT	2058149	69.772	181755	6.162
2014	1198	SLT	641979	20.612	54167	1.739
	360.48	TLT	2156126	69.227	191028	6.133
2015	1176	SLT	641851	20.529	54274	1.736
	361.87	TLT	2146815	68.665	192009	6.141
2016	1116	SLT	633230	20.714	53876	1.762
	353.82	TLT	2119401	69.329	187941	6.148
2017	1305	SLT	721487	20.545	60431	1.721
	406.45	TLT	2386987	67.972	214839	6.118
2018	1172	SLT	649314	20.568	54719	1.733
	365.39	TLT	2165173	68.584	193568	6.131
2019	979	SLT	558449	21.087	46892	1.771
	306.51	TLT	1885796	71.209	164445	6.210
2020	1111	SLT	641663	20.630	53745	1.728
	359.99	TLT	2113250	67.943	190030	6.110
2021	1340	SLT	756587	20.472	64498	1.745
	427.74	TLT	2455923	66.454	223988	6.061
2022	1503	SLT	851663	20.588	72127	1.744
	478.79	TLT	2788839	67.416	252463	6.103
All	14279	SLT	7875273	20.628	664396	1.740
	4418.76	TLT	26199388	68.624	2339209	6.127

Table 1: Experimental SLT and TLT datasets used in the Lightning Tracks selection, with seasonal livetime and event statistics.

Simulation¶

NuGen¶

Baseline Production¶

Our primary baseline simulation is the Snowstorm NuGen production from Simulation Request 13C4, shown in Table 2. These samples were used to train the TNF reconstruction model and were additionally split 50/50 into training and validation sets for the Lightning Tracks final-cut models.

Because TNF was trained on these events, they are considered burned for physics studies, meaning they should not be used to generate Csky NumPy files for sensitivity or performance evaluations.

The remaining held-out half of the datasets continues to serve as the baseline for Data–MC Agreement validation plots.

Table 2

Flavor	\(10^2\)–\(10^4\) GeV	\(10^4\)–\(10^6\) GeV	\(10^6\)–\(10^8\) GeV
\(\nu_e\)	22614 (E\(^{-1.5}\))	22613 (E\(^{-1.5}\))	22612 (E\(^{-1}\))
\(\nu_\mu\)	22646 (E\(^{-1.5}\))	22645 (E\(^{-1.5}\))	22644 (E\(^{-1}\))
\(\nu_\tau\)	22633 (E\(^{-1.5}\))	22634 (E\(^{-1.5}\))	22635 (E\(^{-1}\))

Table 2: Baseline NuGen Snowstorm production (SPICE FTP-v3) used for TNF and final-cut training/validation; dataset IDs with spectral slopes shown in parentheses.

Systematics—Nominal Ensemble¶

Our systematics studies rely primarily on the Snowstorm ensemble sets from Simulation Request 13D7 (see Table 3). These are the nominal NuGen samples used to produce the Csky signal MC NumPy files, and therefore form the foundation of all Lightning Tracks performance results—including sensitivities, effective areas, PSF calibration, and Data–MC agreement.

Table 3

Flavor	20–100 GeV	\(10^2\)–\(10^4\) GeV	\(10^4\)–\(10^6\) GeV	\(10^6\)–\(10^8\) GeV
\(\nu_\mu\)	22861 (E\(^{-1.5}\))	22852 (E\(^{-1.5}\))	22853 (E\(^{-1.5}\))	22854 (E\(^{-1}\))
\(\nu_e\)		22855 (E\(^{-1.5}\))	22856 (E\(^{-1.5}\))	22857 (E\(^{-1}\))
\(\nu_\tau\)		22858 (E\(^{-1.5}\))	22859 (E\(^{-1.5}\))	22860 (E\(^{-1}\))

Table 3: NuGen Snowstorm ensemble (SPICE FTP-v3) used as nominal signal MC for Csky performance files; dataset IDs with spectral slopes shown in parentheses.

Systematics—Hole Ice Variation¶

To evaluate hole-ice uncertainties, we processed an additional Snowstorm variant (SimReq 13CC), summarized in Table 4. These sets use unified hole-ice parameters (p0 = −0.27, p1 = −0.042) and serve as a controlled deviation from the nominal 13D7 ensemble.

Table 4

Flavor	\(10^2\)–\(10^4\) GeV	\(10^4\)–\(10^6\) GeV	\(10^6\)–\(10^8\) GeV
\(\nu_\mu\)	22684 (E\(^{-1.5}\))	22685 (E\(^{-1.5}\))	22686 (E\(^{-1}\))
\(\nu_e\)	22687 (E\(^{-1.5}\))	22688 (E\(^{-1.5}\))	22689 (E\(^{-1}\))
\(\nu_\tau\)	22690 (E\(^{-1.5}\))	22691 (E\(^{-1.5}\))	22692 (E\(^{-1}\))

Table 4: Snowstorm off-baseline hole-ice systematics (IC86-2023, SPICE FTP-v3) using unified hole-ice p0/p1; dataset IDs with spectral slopes shown in parentheses.

Systematics—Historical / Legacy NuGen¶

We also make use of the legacy ESTES NuGen production from SimReq 13AB, shown in Table 5. This set is not a systematics ensemble by construction—it was originally generated as a baseline under SPICE 3.2.1 using an older processing chain. We treat it as a systematic variation to evaluate the impact of outdated simulation and reconstruction infrastructure. In particular, we used it to assess the effect of the Calibration Errata Bug and, more broadly, to understand how Lightning Tracks responds to “ancient” processing conditions.

Table 5

Flavor	\(10^2\)–\(10^4\) GeV	\(10^4\)–\(10^6\) GeV	\(10^6\)–\(10^8\) GeV
\(\nu_\mu\)	21813 (E\(^{-2}\))	21814 (E\(^{-1.5}\))	21938 (E\(^{-1}\))
\(\nu_e\)		21871 (E\(^{-1.5}\))
\(\nu_\tau\)	21867 (E\(^{-2}\))	21868 (E\(^{-1.5}\))	21939 (E\(^{-1}\))

Table 5: Legacy ESTES NuGen baseline (IC86-2020, SPICE 3.2.1) repurposed as a systematics test for old simulation software; dataset IDs with spectral slopes shown in parentheses.

CORSIKA¶

For CORSIKA, we rely almost exclusively on the 20904 production—at present the only sample with sufficient statistics for meaningful background modeling, especially in the low‐energy regime. This set is somewhat old and was generated using the SPICE 3.2 ice model, but despite its age it remains the only viable high-statistics muon bundle simulation available to us.

The first half of 20904 (split by run number) was used to train all classifier models that depend on atmospheric muon background—namely the final-cut classifiers and the downgoing–throughgoing filter network. The second half is held out for validation and forms the dataset used in our Data–MC Agreement comparisons.