Description
This is a summary thread for a few slowdowns that I noticed when handling large-ish datasets (e.g., 15000 templates x 500 channels x 1800 samples). I'm not calling them "bottlenecks" because it's rather the sum of things together that take extra time, none of these slowdowns changes EQcorrscan fundamentally.
I will create some PRs for each point that I have a suggested solution so that we can systematically merge, improve, or reject the suggested solutions. Will add the links to the PRs here, but I'll need to organize a bit for that..
Here are some slowdowns (tests with python 3.11, all in serial code):
-
tribe._group_templates
: 50x speed up
- problem O^2 double loop takes ~20 s for some thousand templates. Sped up single loop to <0.5 s.
- PR: Speedup 01 Quicker grouping of templates #524
-
preprocessing._prep_data_for_correlation
: 3x speed up for function:
- Problem: with heterogeneous templates (i.e., many templates with different station setups), filling the templates with NaN-channels takes a long time (many copy-operations in serial). (ca. 150 s -> 50 s for 1500 templates with up to 500 channels).
- PR: Speedup 02: 3x speed up for prep_data_for_correlation with custom copy and trace-selection #525
-
matched_filter.match_filter
withcopy_data=True
: 3x speedup for copy
- copying of streams (templates and continuous data) takes XYZ s because deepcopy of streams/traces is slow. Custom copy functions can speed it up by a factor of ~3x. Circa 18 s previously for 300 24h- traces, now ~6 s.
- PR (same as above): Speedup 02: 3x speed up for prep_data_for_correlation with custom copy and trace-selection #525
-
detection
,lag_calc
,pre_processing
: 4x / 100x speedup for trace selection
- trace selection by trace id from a stream is still a slowdown (even after ~10x speedup for selecting traces from Stream by non-wildcarded trace ID obspy/obspy#2886). Can be speed up by x4 with simplified function and ~x100 with dict-lookup for streams with fixed order of traces.
- PR: Speedup 04: quick trace selection #526
-
core.match_filter.family._uniq
: 1.9x speedup
- retrieving the unique list of detections is quicker for many detections with
list(set)
(1.9x speedup for 43000 detections, fastest: 3.1 s), but 1.2x slower for small sets (e.g., 430 detections; 50 ms --> 27 ms). - PR: Speedup 05: Retrieve unique detections in family and in
matched_filter
#527
-
core.match_filter.detect
- 1000x speed up for many calls tofamily._uniq
- using
family._uniq
in a loop over all families is still rather slow with_uniq
. Checking tuples of(detection.id, detection.detect_time, detection.detect_val)
withnumpy.unique
and avoiding a loop is 1000x faster. From 752 s to <1 s for 82000 detections. - PR (same as above): Speedup 05: Retrieve unique detections in family and in
matched_filter
#527
-
matched_filter._group_detect
: 30x speedup in handling detections
- selecting detections by template name in big list can be slow via loop. Dict-lookup offers ~50x speedup
- adding
prepick
to many picks (e.g., 400k) is somewhat slow because ofUTCDateTime
overhead. ~4x speedup with adding topick.time.ns
directly. - PR: Speedup 06: 30x speedup for _group_detect - handling detections #528
Is your feature request related to a problem? Please describe.
All the points in the upper list occur in parts of the code where parallelization cannot help speed up execution . When running EQcorrscan on a big cluster, it's wasteful to spend as much time reorganizing data in serial as it takes to run the well parallelized template matching correlations etc.
Three more slowdowns where parallelization can help:
-
core.match_filter
: 2.5x speedup for MAD threshold calc
np.median(np.abs(cccsum))
for each cccsum takes a lot of time when there are many cccsum in cccsums. Only quicker solution I found was to parallelize the operation, which surprisingly could speed up problems bigger than ~15 cccsum already. The speedup is only ~2.5x, so even though that matters a lot for many cccsum (e.g., 2000: 20 s vs 50 s), it feels like this has more potential for even more speedup.- PR: Speedup 07: 2.5x speedup with parallel parallel median / MAD calculation #531
-
detection._calculate_event
: 35% speedup in parallel
- calling this for many detections is slow when a lot of events need to be created. Parallelization can help to speed this up a bit (35 % for 460 detections in test case).
-
utils.catalog_to_dd.write_correlations
: 20 % speedup using some shared memory
- Starting each worker for one reference event is slow because the event and stream for all neighbors of the reference event need to be pickled to the worker. Using some shared memory should be able to help here (PR: 20 % speedup by moving trace.data numpy-arrays into shared memory)
- PR: Speedup 09: Use shared memory for hypo-dd write_correlations for 20 % speedup #529
Activity