Arm: Speed up FLOAT2INT16 conversion with Neon #379

agosdahu · 2024-12-05T13:30:05Z

Using Neon for float to int conversion and introducing platform specific function for converting an array of float values to int16. Also adding appropriate unit test.

jmvalin · 2024-12-10T18:38:06Z

Haven't yet had time to look at all the details, but generally looks like a good thing to add. Do you have any figures for the actual speedup this change provides?

agosdahu · 2024-12-12T01:13:14Z

Running the same sequence of calling opus_demo on all twelve .bit files for both mono and stereo setup at 48k sampling rate as in the provided run_vectors.sh script, on a single Cortex A55 core @1.8Ghz, gave us 14% of performance uplift when measured via linux perf tool.

Since testvector09.bit and testvector10.bit are the largest test inputs, we also ran a sequence of a 100-100 consecutive stereo decodings for each file at 48k sampling rate.
Measurements for this run gave us 11% of performance uplift in the same test environment as above.

agosdahu · 2024-12-13T15:28:54Z

Running the same sequence of calling opus_demo on all twelve .bit files for both mono and stereo setup at 48k sampling rate as in the provided run_vectors.sh script, on a single Cortex A55 core @1.8Ghz, gave us 14% of performance uplift when measured via linux perf tool.

Since testvector09.bit and testvector10.bit are the largest test inputs, we also ran a sequence of a 100-100 consecutive stereo decodings for each file at 48k sampling rate. Measurements for this run gave us 11% of performance uplift in the same test environment as above.

I must also mention that we've built Opus with "-O2 -march=armv8-a" flags on Clang 18.1.8 to get those numbers and measurements were conducted (as described above) on the small core of a Google Tensor G2.

I also cross checked with a build with the same flags on GCC 14.2 and the uplift of the patch was 7% when measured with /usr/bin/time tool for both test cases (consecutive decode runs in both mono and stereo output mode for all 12 test vectors, respectively, and 100-100 stereo decodings for test vectors 9 and 10. All at 48k sampling rate.

In general, clang builds performed ~13% better on average for absolute runtimes in our cases.

jmvalin · 2025-02-05T15:55:35Z

celt/arm/celt_neon_intr.c

+   {
+      out[i] = FLOAT2INT16(in[i]);
+   }
+}


Would be good to add an OPUS_CHECK_ASM block to verify that the results match the C code. You can grep for OPUS_CHECK_ASM to see how it's done in other parts of the code

I would be happy to add this check, however I'm fairly certain that in the corner case where we convert numbers exactly between two integers it will round differently, as it does in other cases already.

As I've seen:

Intel SIMD variants rounds towards zero (via truncation)

MSVC x86 assembly variant depends on FPU rounding mode

Fallback manual rounding method rounds towards +∞

Most other variants of float2int uses round to nearest, ties to even.

Using vcvtaq_s32_f32 intrinsic on AArch64 systems will round to nearest, ties away from zero

One solution could be to further extend unit tests already added to check the correctness of the conversions.
Alternatively, we could aim to achieve the more ubiquitous behaviour with a bit of performance penalty.

As far as I understand, however, in the case of digital signal processing, the benefit of the performance uplift of this solution outweighs the occasional mismatch by one on the output and could be acceptable.

Please advise me how to proceed / what would be an acceptable solution for you.

Yeah, I see the problem. Indeed we don't really care how ties get rounded. Maybe a simple way to do the check would just be to verify that the integer value differs from the input float by less than 0.501 or so?

I pushed a modification containing a check to see if the intrinsic implementation is off by maximum 1

jmvalin · 2025-02-05T16:13:22Z

So the patch looks good. See my comment about OPUS_CHECK_ASM. Otherwise, it would be good if you could run the automated tests to see that nothing breaks.
To do that, you'll want to go to the tests directory and do:
./opus_build_test.sh <tarball> <count> <opus_testvectors> <opus_newvectors>
where:
<tarball> is the output of a "make dist" with your code
<count> is the number of random builds to run (e.g. 1000)
<opus_testvectors> points to the old testvector directory (https://www.opus-codec.org/docs/opus_testvectors.tar.gz)
<opus_newvectors> points to the new testvector directory (https://opus-codec.org/docs/opus_testvectors-rfc8251.tar.gz)

To test on ARM, you'll just need to change this line from the random_config.sh script:
arch=echo -e "\n-march=core2\n-march=sandybridge\n-march=broadwell" | shuf -n1
to use -march options for your compiler.

If you run into issues with this test, I can run it myself in a few days, but the fastest ARM setup I have is a RPi5 so I won't be able to run that many tests.

agosdahu · 2025-02-12T15:29:00Z

I ran the tests on my Mac M2 (within an Ubuntu VM), but many tests were failing.
(hit the 10 job failure limit and the longest I could do, out of multiple attempts, fell just a bit short of a 100 runs)
I cross-checked with the main branch HEAD and I experienced that most tests were failing there as well, without our modifications.

I found no obvious pattern regarding which configuration could have caused the problem.
During the making of the patch only the run_vectors.sh script was invoked on the test vector set provided in the readme file and there were no failing cases.

I tested with the following sequence:

Checked out the repo at the right branch and modified the tests/random_config.sh script via replacing the line containing -march flags with the following line:

    arch=`echo -e "\n-march=armv8-a\n-march=native" | shuf -n1`

Navigated to the root folder in a terminal to execute the following commands in order:
% ./autogen.sh
% ./configure
% make
% make dist
Copied the distribution tarball (opus-1.5.2-39-<somehash?>-dirty.tar.gz in case of main branch, opus-unknown.tar.gz in other case) to tests directory.
Downloaded and extracted test vectors into tests/testvecs/<old|new> respectively.
Navigated to tests folder in the terminal to execute the recommended command:
% `/opus_build_test.sh opus-1.5.2-39-g734aed05-dirty.tar.gz 1000 testvecs/old/ testvecs/new/
I was examining the testvectors_output.txt, logs_<mono|stereo>[2].txt and random_config.txtfiles in the resulting test run outputs to see that the cause of failure is Internal weighted error is {too high}

Unfortunately I can't say much about the correctness of the patch with unsuccessful test runs.

Could you please check if you experience the same phenomenon or correct me if my method of running the tests is wrong?

jmvalin · 2025-02-12T18:54:24Z

Strange. Assuming the new and old testvectors weren't reversed, can you try running just the testvectors manually? You can do so with:
./tests/run_vectors.sh <path> <testvectors> 48000
where:
<path> is the directory where opus_demo and opus_compare are located
<testvectors> is the directory that has the new testvectors (unless you built with --disable-rfc8251 in which case it would be the old testvectors)

agosdahu · 2025-02-13T10:25:10Z

I ran the run_vectors.sh script for both old and new vector sets and both returned with "All tests have passed successfully".
Side note: when inspecting the log_<mono|stereo>[2] files, there were often cases where interim results had "Test vector FAILS" and Internal weighted error was a relatively large number (compared to the others)

I reconfigured, however, my build with --disable-rfc8251 and every run_vectors.sh run failed (with old and new vector set alike)

jmvalin · 2025-02-13T13:07:28Z

Oh, I think I gave you the wrong URL for the old testvectors. Try this: https://www.opus-codec.org/docs/opus_testvectors.tar.gz
Sorry about that

agosdahu · 2025-02-13T14:22:51Z

The new url looks promising, run_vectors.sh can finish successfully! Thanks!
I will run the tests in the upcoming days.

jmvalin · 2025-02-14T15:40:10Z

So I just landed some changes that appear to break your patch, but it shouldn't be too hard to update. In your patch, you had:

-      for (i=0;i<ret*st->channels;i++)
-         pcm[i] = FLOAT2INT16(out[i]);
+      celt_float2int16(out, pcm, ret*st->channels, st->arch);

and it now turns out that the code reads:

          for (i=0;i<ret*st->channels;i++)
             pcm[i] = RES2INT16(out[i]);

The same function is now used for both fixed-point and float, which means that I believe your patch should update the code to read something like:

#ifdef FIXED_POINT
          for (i=0;i<ret*st->channels;i++)
             pcm[i] = RES2INT16(out[i]);
#else
          celt_float2int16(out, pcm, ret*st->channels, st->arch);
#endif

Among the changes that just landed are a new 24-bit integer API, including a opus_decode24() call that decodes to int32 instead of int16. So if you have any bandwidth, it may be interesting to have a separate celt_float2int32() patch to optimize that.

Using Neon for float to int conversion, and introducing platform- specific function for converting an array of float values to int16. Also adding appropriate unit test.

agosdahu · 2025-03-05T12:37:03Z

Rebased on top of your modifications and applied the suggested change

jmvalin · 2025-03-05T15:43:15Z

Were you able to run the opus_build_test.sh script with your updated patch?

agosdahu · 2025-03-06T10:22:15Z

Were you able to run the opus_build_test.sh script with your updated patch?

I was able to, but unfortunately both the patched and main branches produced failing tests.

I used a 32-core Ampere eMAG workstation to run the tests via the “opus_build_tests.sh” script with the changed line as presented below in order to get outputs from as many builds as possible:

-seq -w "$nb_tests" | parallel --halt now,fail=10 -j +2 -q ../random_config.sh "build_tests/run_{}" "$configure_dir" "$oldvectors" "$newvectors"
+seq -w "$nb_tests" | parallel -j "$(nproc)" -q ../random_config.sh "build_tests/run_{}" "$configure_dir" "$oldvectors" "$newvectors"

I executed 1000-1000 runs each for both the top of the main branch (at commit c79a9bd1, which I called a “vanilla” run) and the patched branch.
I deemed a run successful without any problems, when the generated random_config.txt ended in “all tests PASS” and failed if it ended with “check FAIL”.
I also randomly re-ran a few of the runs to ascertain whether the outcome remained unchanged (which it did).

441/1000 vanilla runs passed without any problems
454/1000 patched runs passed without any problems

I did not run a thorough analysis on all the outputs in search for all the causes for failure, but there didn't seem to be a single root cause or at least it wasn't obvious to me.
I checked, however, for any OPUS_CHECK_ASM or celt_assert related errors and found none.

Please advise me if there is anything more I can do to advance the fate of this PR.

I have a collection of the build outputs, which I initially used to analyse.
Do you want me to share it with you?

I have collected the following four files for each of the runs on both branches and prefixed them with their run ID. (e.g.: 0001_<filename.txt>):

configure_output.txt
make_output.txt
makecheck_output.txt
random_config.txt

I’ve also collected the cflags, the unique config parameters and the testvector sets used for each failing runs and listed them, as well as their run ID, in three different files respectively.

I still have the full build outputs, if you'd need some other logs, I'd be happy to get those as well for you.

jmvalin · 2025-03-10T15:15:33Z

Can you upload the data for the failures you get? I'd need the random_config.txt file and the relevant _output.txt for where the error occurs.

jmvalin reviewed Feb 5, 2025

View reviewed changes

agosdahu force-pushed the neon_float2int branch from 825c482 to 2efa9d3 Compare February 6, 2025 09:01

Arm: Speed up FLOAT2INT16 conversion with Neon

35b5feb

Using Neon for float to int conversion, and introducing platform- specific function for converting an array of float values to int16. Also adding appropriate unit test.

agosdahu force-pushed the neon_float2int branch from 2efa9d3 to 35b5feb Compare February 27, 2025 13:41

jmvalin mentioned this pull request Mar 5, 2025

RFC: Streamline implementation overrides #392

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arm: Speed up FLOAT2INT16 conversion with Neon #379

Arm: Speed up FLOAT2INT16 conversion with Neon #379

agosdahu commented Dec 5, 2024

jmvalin commented Dec 10, 2024

agosdahu commented Dec 12, 2024 •

edited

Loading

agosdahu commented Dec 13, 2024

jmvalin Feb 5, 2025

agosdahu Feb 13, 2025

jmvalin Feb 13, 2025

agosdahu Mar 5, 2025

jmvalin commented Feb 5, 2025 •

edited

Loading

agosdahu commented Feb 12, 2025

jmvalin commented Feb 12, 2025 •

edited

Loading

agosdahu commented Feb 13, 2025

jmvalin commented Feb 13, 2025

agosdahu commented Feb 13, 2025

jmvalin commented Feb 14, 2025

agosdahu commented Mar 5, 2025

jmvalin commented Mar 5, 2025

agosdahu commented Mar 6, 2025

jmvalin commented Mar 10, 2025

Arm: Speed up FLOAT2INT16 conversion with Neon #379

Are you sure you want to change the base?

Arm: Speed up FLOAT2INT16 conversion with Neon #379

Conversation

agosdahu commented Dec 5, 2024

jmvalin commented Dec 10, 2024

agosdahu commented Dec 12, 2024 • edited Loading

agosdahu commented Dec 13, 2024

jmvalin Feb 5, 2025

Choose a reason for hiding this comment

agosdahu Feb 13, 2025

Choose a reason for hiding this comment

jmvalin Feb 13, 2025

Choose a reason for hiding this comment

agosdahu Mar 5, 2025

Choose a reason for hiding this comment

jmvalin commented Feb 5, 2025 • edited Loading

agosdahu commented Feb 12, 2025

jmvalin commented Feb 12, 2025 • edited Loading

agosdahu commented Feb 13, 2025

jmvalin commented Feb 13, 2025

agosdahu commented Feb 13, 2025

jmvalin commented Feb 14, 2025

agosdahu commented Mar 5, 2025

jmvalin commented Mar 5, 2025

agosdahu commented Mar 6, 2025

jmvalin commented Mar 10, 2025

agosdahu commented Dec 12, 2024 •

edited

Loading

jmvalin commented Feb 5, 2025 •

edited

Loading

jmvalin commented Feb 12, 2025 •

edited

Loading