-
Notifications
You must be signed in to change notification settings - Fork 649
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arm: Speed up FLOAT2INT16 conversion with Neon #379
base: main
Are you sure you want to change the base?
Conversation
Haven't yet had time to look at all the details, but generally looks like a good thing to add. Do you have any figures for the actual speedup this change provides? |
Running the same sequence of calling opus_demo on all twelve .bit files for both mono and stereo setup at 48k sampling rate as in the provided run_vectors.sh script, on a single Cortex A55 core @1.8Ghz, gave us 14% of performance uplift when measured via linux perf tool. Since testvector09.bit and testvector10.bit are the largest test inputs, we also ran a sequence of a 100-100 consecutive stereo decodings for each file at 48k sampling rate. |
I must also mention that we've built Opus with "-O2 -march=armv8-a" flags on Clang 18.1.8 to get those numbers and measurements were conducted (as described above) on the small core of a Google Tensor G2. I also cross checked with a build with the same flags on GCC 14.2 and the uplift of the patch was 7% when measured with /usr/bin/time tool for both test cases (consecutive decode runs in both mono and stereo output mode for all 12 test vectors, respectively, and 100-100 stereo decodings for test vectors 9 and 10. All at 48k sampling rate. In general, clang builds performed ~13% better on average for absolute runtimes in our cases. |
{ | ||
out[i] = FLOAT2INT16(in[i]); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to add an OPUS_CHECK_ASM block to verify that the results match the C code. You can grep for OPUS_CHECK_ASM to see how it's done in other parts of the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be happy to add this check, however I'm fairly certain that in the corner case where we convert numbers exactly between two integers it will round differently, as it does in other cases already.
As I've seen:
- Intel SIMD variants
rounds towards zero
(via truncation) - MSVC x86 assembly variant depends on FPU rounding mode
- Fallback manual rounding method
rounds towards +∞
- Most other variants of
float2int
usesround to nearest, ties to even
.
Using vcvtaq_s32_f32
intrinsic on AArch64 systems will round to nearest, ties away from zero
One solution could be to further extend unit tests already added to check the correctness of the conversions.
Alternatively, we could aim to achieve the more ubiquitous behaviour with a bit of performance penalty.
As far as I understand, however, in the case of digital signal processing, the benefit of the performance uplift of this solution outweighs the occasional mismatch by one on the output and could be acceptable.
Please advise me how to proceed / what would be an acceptable solution for you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I see the problem. Indeed we don't really care how ties get rounded. Maybe a simple way to do the check would just be to verify that the integer value differs from the input float by less than 0.501 or so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed a modification containing a check to see if the intrinsic implementation is off by maximum 1
So the patch looks good. See my comment about OPUS_CHECK_ASM. Otherwise, it would be good if you could run the automated tests to see that nothing breaks. To test on ARM, you'll just need to change this line from the random_config.sh script: If you run into issues with this test, I can run it myself in a few days, but the fastest ARM setup I have is a RPi5 so I won't be able to run that many tests. |
825c482
to
2efa9d3
Compare
I ran the tests on my Mac M2 (within an Ubuntu VM), but many tests were failing. I found no obvious pattern regarding which configuration could have caused the problem. I tested with the following sequence:
Unfortunately I can't say much about the correctness of the patch with unsuccessful test runs. Could you please check if you experience the same phenomenon or correct me if my method of running the tests is wrong? |
Strange. Assuming the new and old testvectors weren't reversed, can you try running just the testvectors manually? You can do so with: |
I ran the I reconfigured, however, my build with |
Oh, I think I gave you the wrong URL for the old testvectors. Try this: https://www.opus-codec.org/docs/opus_testvectors.tar.gz |
The new url looks promising, |
So I just landed some changes that appear to break your patch, but it shouldn't be too hard to update. In your patch, you had:
and it now turns out that the code reads:
The same function is now used for both fixed-point and float, which means that I believe your patch should update the code to read something like:
Among the changes that just landed are a new 24-bit integer API, including a opus_decode24() call that decodes to int32 instead of int16. So if you have any bandwidth, it may be interesting to have a separate celt_float2int32() patch to optimize that. |
Using Neon for float to int conversion, and introducing platform- specific function for converting an array of float values to int16. Also adding appropriate unit test.
2efa9d3
to
35b5feb
Compare
Rebased on top of your modifications and applied the suggested change |
Were you able to run the opus_build_test.sh script with your updated patch? |
I was able to, but unfortunately both the patched and main branches produced failing tests. I used a 32-core Ampere eMAG workstation to run the tests via the “opus_build_tests.sh” script with the changed line as presented below in order to get outputs from as many builds as possible:
I executed 1000-1000 runs each for both the top of the main branch (at commit 441/1000 vanilla runs passed without any problems I did not run a thorough analysis on all the outputs in search for all the causes for failure, but there didn't seem to be a single root cause or at least it wasn't obvious to me. Please advise me if there is anything more I can do to advance the fate of this PR. I have a collection of the build outputs, which I initially used to analyse. I have collected the following four files for each of the runs on both branches and prefixed them with their run ID. (e.g.:
I’ve also collected the cflags, the unique config parameters and the testvector sets used for each failing runs and listed them, as well as their run ID, in three different files respectively. I still have the full build outputs, if you'd need some other logs, I'd be happy to get those as well for you. |
Can you upload the data for the failures you get? I'd need the random_config.txt file and the relevant _output.txt for where the error occurs. |
Using Neon for float to int conversion and introducing platform specific function for converting an array of float values to int16. Also adding appropriate unit test.