Skip to content

WIP: Speed up a few slowdowns when handling large datasets #522

Open
@flixha

Description

@flixha

This is a summary thread for a few slowdowns that I noticed when handling large-ish datasets (e.g., 15000 templates x 500 channels x 1800 samples). I'm not calling them "bottlenecks" because it's rather the sum of things together that take extra time, none of these slowdowns changes EQcorrscan fundamentally.

I will create some PRs for each point that I have a suggested solution so that we can systematically merge, improve, or reject the suggested solutions. Will add the links to the PRs here, but I'll need to organize a bit for that..

Here are some slowdowns (tests with python 3.11, all in serial code):

Is your feature request related to a problem? Please describe.
All the points in the upper list occur in parts of the code where parallelization cannot help speed up execution . When running EQcorrscan on a big cluster, it's wasteful to spend as much time reorganizing data in serial as it takes to run the well parallelized template matching correlations etc.

Three more slowdowns where parallelization can help:

    1. core.match_filter: 2.5x speedup for MAD threshold calc
    • np.median(np.abs(cccsum)) for each cccsum takes a lot of time when there are many cccsum in cccsums. Only quicker solution I found was to parallelize the operation, which surprisingly could speed up problems bigger than ~15 cccsum already. The speedup is only ~2.5x, so even though that matters a lot for many cccsum (e.g., 2000: 20 s vs 50 s), it feels like this has more potential for even more speedup.
    • PR: Speedup 07: 2.5x speedup with parallel parallel median / MAD calculation #531
    1. detection._calculate_event: 35% speedup in parallel
    • calling this for many detections is slow when a lot of events need to be created. Parallelization can help to speed this up a bit (35 % for 460 detections in test case).
    1. utils.catalog_to_dd.write_correlations: 20 % speedup using some shared memory

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions