Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arm: Speed up FLOAT2INT16 conversion with Neon #379

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

agosdahu
Copy link

@agosdahu agosdahu commented Dec 5, 2024

Using Neon for float to int conversion and introducing platform specific function for converting an array of float values to int16. Also adding appropriate unit test.

@jmvalin
Copy link
Member

jmvalin commented Dec 10, 2024

Haven't yet had time to look at all the details, but generally looks like a good thing to add. Do you have any figures for the actual speedup this change provides?

@agosdahu
Copy link
Author

agosdahu commented Dec 12, 2024

Running the same sequence of calling opus_demo on all twelve .bit files for both mono and stereo setup at 48k sampling rate as in the provided run_vectors.sh script, on a single Cortex A55 core @1.8Ghz, gave us 14% of performance uplift when measured via linux perf tool.

Since testvector09.bit and testvector10.bit are the largest test inputs, we also ran a sequence of a 100-100 consecutive stereo decodings for each file at 48k sampling rate.
Measurements for this run gave us 11% of performance uplift in the same test environment as above.

@agosdahu
Copy link
Author

Running the same sequence of calling opus_demo on all twelve .bit files for both mono and stereo setup at 48k sampling rate as in the provided run_vectors.sh script, on a single Cortex A55 core @1.8Ghz, gave us 14% of performance uplift when measured via linux perf tool.

Since testvector09.bit and testvector10.bit are the largest test inputs, we also ran a sequence of a 100-100 consecutive stereo decodings for each file at 48k sampling rate. Measurements for this run gave us 11% of performance uplift in the same test environment as above.

I must also mention that we've built Opus with "-O2 -march=armv8-a" flags on Clang 18.1.8 to get those numbers and measurements were conducted (as described above) on the small core of a Google Tensor G2.

I also cross checked with a build with the same flags on GCC 14.2 and the uplift of the patch was 7% when measured with /usr/bin/time tool for both test cases (consecutive decode runs in both mono and stereo output mode for all 12 test vectors, respectively, and 100-100 stereo decodings for test vectors 9 and 10. All at 48k sampling rate.

In general, clang builds performed ~13% better on average for absolute runtimes in our cases.

{
out[i] = FLOAT2INT16(in[i]);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to add an OPUS_CHECK_ASM block to verify that the results match the C code. You can grep for OPUS_CHECK_ASM to see how it's done in other parts of the code

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be happy to add this check, however I'm fairly certain that in the corner case where we convert numbers exactly between two integers it will round differently, as it does in other cases already.

As I've seen:

  • Intel SIMD variants rounds towards zero (via truncation)
  • MSVC x86 assembly variant depends on FPU rounding mode
  • Fallback manual rounding method rounds towards +∞
  • Most other variants of float2int uses round to nearest, ties to even.

Using vcvtaq_s32_f32 intrinsic on AArch64 systems will round to nearest, ties away from zero

One solution could be to further extend unit tests already added to check the correctness of the conversions.
Alternatively, we could aim to achieve the more ubiquitous behaviour with a bit of performance penalty.

As far as I understand, however, in the case of digital signal processing, the benefit of the performance uplift of this solution outweighs the occasional mismatch by one on the output and could be acceptable.

Please advise me how to proceed / what would be an acceptable solution for you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I see the problem. Indeed we don't really care how ties get rounded. Maybe a simple way to do the check would just be to verify that the integer value differs from the input float by less than 0.501 or so?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a modification containing a check to see if the intrinsic implementation is off by maximum 1

@jmvalin
Copy link
Member

jmvalin commented Feb 5, 2025

So the patch looks good. See my comment about OPUS_CHECK_ASM. Otherwise, it would be good if you could run the automated tests to see that nothing breaks.
To do that, you'll want to go to the tests directory and do:
./opus_build_test.sh <tarball> <count> <opus_testvectors> <opus_newvectors>
where:
<tarball> is the output of a "make dist" with your code
<count> is the number of random builds to run (e.g. 1000)
<opus_testvectors> points to the old testvector directory (https://www.opus-codec.org/docs/opus_testvectors.tar.gz)
<opus_newvectors> points to the new testvector directory (https://opus-codec.org/docs/opus_testvectors-rfc8251.tar.gz)

To test on ARM, you'll just need to change this line from the random_config.sh script:
arch=echo -e "\n-march=core2\n-march=sandybridge\n-march=broadwell" | shuf -n1
to use -march options for your compiler.

If you run into issues with this test, I can run it myself in a few days, but the fastest ARM setup I have is a RPi5 so I won't be able to run that many tests.

@agosdahu
Copy link
Author

I ran the tests on my Mac M2 (within an Ubuntu VM), but many tests were failing.
(hit the 10 job failure limit and the longest I could do, out of multiple attempts, fell just a bit short of a 100 runs)
I cross-checked with the main branch HEAD and I experienced that most tests were failing there as well, without our modifications.

I found no obvious pattern regarding which configuration could have caused the problem.
During the making of the patch only the run_vectors.sh script was invoked on the test vector set provided in the readme file and there were no failing cases.

I tested with the following sequence:

  • Checked out the repo at the right branch and modified the tests/random_config.sh script via replacing the line containing -march flags with the following line:
    arch=`echo -e "\n-march=armv8-a\n-march=native" | shuf -n1`
  • Navigated to the root folder in a terminal to execute the following commands in order:
    % ./autogen.sh
    % ./configure
    % make
    % make dist
  • Copied the distribution tarball (opus-1.5.2-39-<somehash?>-dirty.tar.gz in case of main branch, opus-unknown.tar.gz in other case) to tests directory.
  • Downloaded and extracted test vectors into tests/testvecs/<old|new> respectively.
  • Navigated to tests folder in the terminal to execute the recommended command:
    % `/opus_build_test.sh opus-1.5.2-39-g734aed05-dirty.tar.gz 1000 testvecs/old/ testvecs/new/
  • I was examining the testvectors_output.txt, logs_<mono|stereo>[2].txt and random_config.txtfiles in the resulting test run outputs to see that the cause of failure is Internal weighted error is {too high}

Unfortunately I can't say much about the correctness of the patch with unsuccessful test runs.

Could you please check if you experience the same phenomenon or correct me if my method of running the tests is wrong?

@jmvalin
Copy link
Member

jmvalin commented Feb 12, 2025

Strange. Assuming the new and old testvectors weren't reversed, can you try running just the testvectors manually? You can do so with:
./tests/run_vectors.sh <path> <testvectors> 48000
where:
<path> is the directory where opus_demo and opus_compare are located
<testvectors> is the directory that has the new testvectors (unless you built with --disable-rfc8251 in which case it would be the old testvectors)

@agosdahu
Copy link
Author

I ran the run_vectors.sh script for both old and new vector sets and both returned with "All tests have passed successfully".
Side note: when inspecting the log_<mono|stereo>[2] files, there were often cases where interim results had "Test vector FAILS" and Internal weighted error was a relatively large number (compared to the others)

I reconfigured, however, my build with --disable-rfc8251 and every run_vectors.sh run failed (with old and new vector set alike)

@jmvalin
Copy link
Member

jmvalin commented Feb 13, 2025

Oh, I think I gave you the wrong URL for the old testvectors. Try this: https://www.opus-codec.org/docs/opus_testvectors.tar.gz
Sorry about that

@agosdahu
Copy link
Author

The new url looks promising, run_vectors.sh can finish successfully! Thanks!
I will run the tests in the upcoming days.

@jmvalin
Copy link
Member

jmvalin commented Feb 14, 2025

So I just landed some changes that appear to break your patch, but it shouldn't be too hard to update. In your patch, you had:

-      for (i=0;i<ret*st->channels;i++)
-         pcm[i] = FLOAT2INT16(out[i]);
+      celt_float2int16(out, pcm, ret*st->channels, st->arch);

and it now turns out that the code reads:

          for (i=0;i<ret*st->channels;i++)
             pcm[i] = RES2INT16(out[i]);

The same function is now used for both fixed-point and float, which means that I believe your patch should update the code to read something like:

#ifdef FIXED_POINT
          for (i=0;i<ret*st->channels;i++)
             pcm[i] = RES2INT16(out[i]);
#else
          celt_float2int16(out, pcm, ret*st->channels, st->arch);
#endif

Among the changes that just landed are a new 24-bit integer API, including a opus_decode24() call that decodes to int32 instead of int16. So if you have any bandwidth, it may be interesting to have a separate celt_float2int32() patch to optimize that.

Using Neon for float to int conversion, and introducing platform-
specific function for converting an array of float values to int16.
Also adding appropriate unit test.
@agosdahu
Copy link
Author

agosdahu commented Mar 5, 2025

Rebased on top of your modifications and applied the suggested change

@jmvalin
Copy link
Member

jmvalin commented Mar 5, 2025

Were you able to run the opus_build_test.sh script with your updated patch?

@agosdahu
Copy link
Author

agosdahu commented Mar 6, 2025

Were you able to run the opus_build_test.sh script with your updated patch?

I was able to, but unfortunately both the patched and main branches produced failing tests.

I used a 32-core Ampere eMAG workstation to run the tests via the “opus_build_tests.sh” script with the changed line as presented below in order to get outputs from as many builds as possible:

-seq -w "$nb_tests" | parallel --halt now,fail=10 -j +2 -q ../random_config.sh "build_tests/run_{}" "$configure_dir" "$oldvectors" "$newvectors"
+seq -w "$nb_tests" | parallel -j "$(nproc)" -q ../random_config.sh "build_tests/run_{}" "$configure_dir" "$oldvectors" "$newvectors"

I executed 1000-1000 runs each for both the top of the main branch (at commit c79a9bd1, which I called a “vanilla” run) and the patched branch.
I deemed a run successful without any problems, when the generated random_config.txt ended in “all tests PASS” and failed if it ended with “check FAIL”.
I also randomly re-ran a few of the runs to ascertain whether the outcome remained unchanged (which it did).

441/1000 vanilla runs passed without any problems
454/1000 patched runs passed without any problems

I did not run a thorough analysis on all the outputs in search for all the causes for failure, but there didn't seem to be a single root cause or at least it wasn't obvious to me.
I checked, however, for any OPUS_CHECK_ASM or celt_assert related errors and found none.

Please advise me if there is anything more I can do to advance the fate of this PR.

I have a collection of the build outputs, which I initially used to analyse.
Do you want me to share it with you?

I have collected the following four files for each of the runs on both branches and prefixed them with their run ID. (e.g.: 0001_<filename.txt>):

  • configure_output.txt
  • make_output.txt
  • makecheck_output.txt
  • random_config.txt

I’ve also collected the cflags, the unique config parameters and the testvector sets used for each failing runs and listed them, as well as their run ID, in three different files respectively.

I still have the full build outputs, if you'd need some other logs, I'd be happy to get those as well for you.

@jmvalin
Copy link
Member

jmvalin commented Mar 10, 2025

Can you upload the data for the failures you get? I'd need the random_config.txt file and the relevant _output.txt for where the error occurs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants