Performance study of HEP data analysis tools

Introduction

The Python programming language is currently characterized by an enormous and rapidly growing “ecosystem” of software tools that support the analysis and visualization of “Big Data,” datasets that are too large or complex to be dealt with by traditional data-processing methods. Examples include Numpy (numerical computing), pandas (data manipulation), matplotlib (visualization), and tensorflow (machine learning). With the huge and increasingly complex volume of data that large-scale High Energy Physics (HEP) experiments produce, a complete HEP data analysis workflow can provide great benefits.

In this project, we explored different Python-based approaches to deal with the standard ROOT format common to all HEP data. With reference to a representative HEP workflow taken as use case, we measured the performance of Uproot, a library for reading and writing ROOT files in pure Python and NumPy, and of RDataFrame, the modern ROOT's high-level interface for efficient data analysis. To match the performance of a traditional C++ ROOT workflow, we also adopted parallel processing in our study.

Data sample and HW setup

The ROOT files used in our study, which have a total size of about 128GB, were taken from CMS Open Data.

Measurements were run on three different machines on the ReCaS-Bari computing center:

wn-gpu-8-3-22 (256 CPUs, ~2000 GB RAM)
wn-1-8-9 (64 CPUs, ~250GB RAM)
tesla04 (32 CPUs, ~250 GB RAM)

In these tests:

these machine are dedicated exclusively for the performance tests in order to get unbiased results;
the data files are accessed remotely (unless otherwise stated). Only on tesla04 the data can be stored locally and it was possible to carry out locally vs remotely stored data performance tests;
the used ROOT version is 6.24; to address the study of RDF parallelization we also performed a comparison with 6.26 (on tesla04).

Analysis workflow

The analysis task consists in measuring the total runtime of the sequence of the following operations:

accessing the TTree inside the input ROOT files;
applying specific filters (selection criteria) on the variables stored in the TTree, in order to extract a known physical signal associated to the decay chain of a beauty meson (B^0_s → J/psi phi, J/psi→mumu, phi→KK);
converting the TBranch containing the selected invariant mass to a NumPy array for further analysis task (signal fitting, etc ...).

Parallel processing

Instead of letting one CPU handle the whole dataset, we used multiple CPUs running in parallel to process data chunks concurrently.

Without parallelism, the time needed for the analysis task to be completed would be:

\[T = t_u\cdot S\]

Where T is the total run time, t_u is the time to process one unit of data, and S is the overall data size.

By implementing parallel processing, this equation becomes:

\[ T = C(S, P) + t_u\cdot \lceil{S \over P}\rceil \]

where C(S, P) is the time to access/split/merge the data and P is the number of concurrent processes. The fraction S/P is rounded up to make sure that no data is lost.

Parallel processing with Uproot

Uproot does not provide a built-in option for implicit parallel processing. To enable parallelism, therefore, we manually split the data into smaller chunks and then create subprocesses with Python’s multiprocessing module. Since Uproot allows users to specify the number of events from each TTree to be processed, the smallest unit of data assigned to each subprocess can be one event. As a result, data can be distributed evenly among subprocesses, and the runtime function for Uproot would be:

\[ T = C(S, P) + t_e\cdot \lceil{S_e \over P}\rceil \]

where t_e is the time to process one event, and S_e is number of events in the dataset.

Preliminary studies: local vs remote data access

It has been checked that accessing data locally and remotely (on the same machine tesla04) has marginal impact on the runtime: effect is minor for RDF (with slight preference - of course - for local storage) and negligible for Uproot.

Thus the main studies are carried out by accessing data remotely, without biasing the performance comparison.

Preliminary studies: ROOT 6.24 vs 6.26

RDF has a built-in option for parallel processing: EnableImplicitMT(n_threads). We experienced an unstable behavior - on ROOT 6.24 - when letting this method autonomously optimize the number of threads or even when setting n_threads. We have checked that upgrading to ROOT 6.26 this issue disappears:

In the main studies we used ROOT 6.24 and instead of using RDF EnableImplicitMT() we used Python's multiprocessing, just like we did with Uproot.

Results : Runtime vs Processes

By keeping the size of data constant (128 files for 128GB carrying info about (4.5 · 10⁸ events) we obtain the following runtime as a function of the number of processes, for each used server:

The plots show that runtime decreases with the increasing number of processes and saturates when hitting the about constant time C(S, P) needed for initialization and final merging.
While the Uproot curves are fairly smooth, we can appreciate a ladder pattern in RDF ones. This reflects the fact that RDF’s smallest unit of data is one file and not just one event.

Results : Speed-up vs Processes

From the data in the previous plots, we can estimate the speed-up, the quantity defined as the ratio between the runtime for 1 process and the overall runtime for n processes.

It can be observed that there is a significant performance boost due to parallel processing, both with Uproot and RDF. On the most powerful server (wn-gpu-8-3-22), for instance, Uproot with multiprocessing is almost 9 times faster than for a single process. In the case of RDataFrame, we achieved an even more striking speed-up: at 128 subprocesses, it was beyond 50 times faster.

As expected, the saturation effect occurs when running with more subprocesses than the number of CPUs available on the machine. This is clearly visible on wn-1-8-9 and tesla04, for which the number of CPUs is smaller. With Uproot, however, the performance boost always peaks at around 32 processes, regardless the amount of available CPUs. This seems to happen because - for Uproot - C(S, P) is enough large to dominantly contribute to the runtime. Unlikely - in the case of RDF - C(S, P) is marginally contributing to the runtime, and the performance gain can grow until the maximum number of available CPUs is reached.

Results - Runtime vs Size

Another way of studying the performance consists in varying the size of data while keeping constant the number of processes (32 or 64 has been chosen):

The total runtime is linearly proportional to the data size, as expected. In RDF curves, the observed ladder pattern seems to reflect that RDF is parallelized over files instead of events (reminder: each file has a size of ~1GB).

Conclusions

By enabling parallelism, the performance of both Uproot and RDataFrame are well enhanced. While Uproot runs faster at a fewer number of processes, RDataFrame reaches a better performance as the number of processes increases, thus suggesting its use when having at disposal a machine with a large number of CPUs. Note that RDF performances have been optimized by evenly splitting the original dataset into many files with the same number of events. In real-life situations, such condition can be reached either using Uproot or at the initial merging step applied to the rootuples obtained by the distributed experiment framework.