Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

third_party/utf8_range: support arm neon #18126

Closed
wants to merge 1 commit into from
Closed

third_party/utf8_range: support arm neon #18126

wants to merge 1 commit into from

Conversation

cyb70289
Copy link
Contributor

@cyb70289 cyb70289 commented Sep 5, 2024

Protobuf uses utf8_range library for utf8 string validation.
Currently, only SSE implementation is integrated.
This patch adapts utf8_range Neon implementation to protobuf.

@cyb70289 cyb70289 requested a review from a team as a code owner September 5, 2024 05:55
@cyb70289 cyb70289 requested review from acozzette and removed request for a team September 5, 2024 05:55
@cyb70289
Copy link
Contributor Author

cyb70289 commented Sep 5, 2024

hi, I'm author of utf8_range, glad to see my lib adopted by protobuf.
This patch adapts utf8_range Neon implementation to protobuf. Please review. Thanks.

@cyb70289
Copy link
Contributor Author

@acozzette will you have a look at this pr? or someone else can help?

@tonyliaoss
Copy link
Member

Hello Yibo,

I'll reassign this to @danlark1 who is our ARM SIMD expert. Thanks for making this contribution.

In the meantime, I'll approve this for integration testing.

@tonyliaoss tonyliaoss added the 🅰️ safe for tests Mark a commit as safe to run presubmits over label Sep 18, 2024
@github-actions github-actions bot removed the 🅰️ safe for tests Mark a commit as safe to run presubmits over label Sep 18, 2024
@tonyliaoss tonyliaoss requested review from tonyliaoss and removed request for tonyliaoss September 18, 2024 18:18
Copy link
Member

@tonyliaoss tonyliaoss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve for integration testing.

(Do not submit until another approval from @danlark1)

@cyb70289
Copy link
Contributor Author

Hmm..., looks this pr leads to java and ruby linux aarch64 job failure, will check.

@cyb70289
Copy link
Contributor Author

Should have fixed ruby error. Please help start CI jobs.
Not sure of java failure. Does java use this c++ utf8 validation at all?

@tonyliaoss tonyliaoss added the 🅰️ safe for tests Mark a commit as safe to run presubmits over label Sep 19, 2024
@github-actions github-actions bot removed the 🅰️ safe for tests Mark a commit as safe to run presubmits over label Sep 19, 2024
Copy link

@danlark1 danlark1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@cyb70289 cyb70289 requested a review from a team as a code owner September 20, 2024 02:08
@cyb70289 cyb70289 requested review from JasonLunn and removed request for a team September 20, 2024 02:08
@tonyliaoss tonyliaoss added the 🅰️ safe for tests Mark a commit as safe to run presubmits over label Sep 20, 2024
@github-actions github-actions bot removed the 🅰️ safe for tests Mark a commit as safe to run presubmits over label Sep 20, 2024
@cyb70289
Copy link
Contributor Author

cyb70289 commented Sep 20, 2024

A bit struggling about the jave linux aarch64 job failure. Looks it's related to this PR. Any suggestion is welcomed.
https://github.com/protocolbuffers/protobuf/actions/runs/10951974066/job/30409907883?pr=18126

EDIT: managed to reproduce it locally, debugging... Fixed.

@cyb70289
Copy link
Contributor Author

This is the key changes to the original utf8_range.c sse validation code after moving arch dependent code to utf8_range_sse.inc. Might be useful for review.
The major difference is to use "end" pointer instead of "len". end = data + len.

diff --git a/utf8_range.c b/utf8_range_sse.inc
index 57a2a9b..b2d3d18 100644
--- a/utf8_range.c
+++ b/utf8_range_sse.inc
@@ -1,5 +1,5 @@
 static FORCE_INLINE_ATTR inline size_t utf8_range_Validate(
-    const char* data, size_t len, int return_position) {
+    const char* data, const char* end, int return_position) {
   /* This code checks that utf-8 ranges are structurally valid 16 bytes at once
    * using superscalar instructions.
    * The mapping between ranges of codepoint and their corresponding utf-8
@@ -149,6 +149,9 @@ static FORCE_INLINE_ATTR inline size_t utf8_range_Validate(
   __m128i prev_input = _mm_set1_epi8(0);
   __m128i prev_first_len = _mm_set1_epi8(0);
   __m128i error = _mm_set1_epi8(0);
+
+  // Save buffer start address for later use
+  const char* const data_original = data;
   while (end - data >= 16) {
     const __m128i input =
         _mm_loadu_si128((const __m128i*)(data));
@@ -249,13 +252,13 @@ static FORCE_INLINE_ATTR inline size_t utf8_range_Validate(
     data += 16;
   }
   /* If we got to the end, we don't need to skip any bytes backwards */
-  if (return_position && (data - (end - len)) == 0) {
+  if (return_position && data == data_original) {
     return utf8_range_ValidateUTF8Naive(data, end, return_position);
   }
   /* Find previous codepoint (not 80~BF) */
   data -= utf8_range_CodepointSkipBackwards(_mm_extract_epi32(prev_input, 3));
   if (return_position) {
-    return (data - (end - len)) +
+    return (data - data_original) +
            utf8_range_ValidateUTF8Naive(data, end, return_position);
   }
   /* Test if there was any error */

@cyb70289
Copy link
Contributor Author

Hopefully all issues are fixed. Please help trigger CI.

@danlark1
Copy link

Approval for code. Thank you a lot!

@tonyliaoss tonyliaoss added the 🅰️ safe for tests Mark a commit as safe to run presubmits over label Sep 20, 2024
@github-actions github-actions bot removed the 🅰️ safe for tests Mark a commit as safe to run presubmits over label Sep 20, 2024
Copy link
Member

@tonyliaoss tonyliaoss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving again -- thanks for sending us this PR!

@tonyliaoss tonyliaoss added the 🅰️ safe for tests Mark a commit as safe to run presubmits over label Sep 23, 2024
@github-actions github-actions bot removed the 🅰️ safe for tests Mark a commit as safe to run presubmits over label Sep 23, 2024
@cyb70289
Copy link
Contributor Author

How to check CI error feedback/copybara - google internal checks FAILED?

@tonyliaoss
Copy link
Member

Hi Yibo --

This error message about Copybara failures is saying that it's pending internal integration tests to be run before we can pull it into Google. This CL looks good so far, we just need to get approvals internally to get it integrated into our monorepo (and then we can close this PR).

There is no action needed on your part. We've been a bit busy these few days but hopefully we can get this merged soon.

copybara-service bot pushed a commit that referenced this pull request Sep 26, 2024
Protobuf uses utf8_range library for utf8 string validation.
Currently, only SSE implementation is integrated.
This patch adapts utf8_range Neon implementation to protobuf.

Closes #18126

COPYBARA_INTEGRATE_REVIEW=#18126 from cyb70289:utf8-neon 5edbcc2
FUTURE_COPYBARA_INTEGRATE_REVIEW=#18126 from cyb70289:utf8-neon 5edbcc2
PiperOrigin-RevId: 679316668
@tonyliaoss
Copy link
Member

This is failing our internal integration tests.

I haven't fully debugged what's going on, but I can tell a behavior change happened in the SSE (non-neon) codepath, due to the changes that you mentioned in a previous comment, in these two places:

Screenshot 2024-09-27 at 9 12 47 PM Screenshot 2024-09-27 at 9 12 59 PM

If I revert these two changes, the regression disappears.

Specifically it seems like the change on this line is somewhat problematic:

   /* Find previous codepoint (not 80~BF) */
   data -= utf8_range_CodepointSkipBackwards(_mm_extract_epi32(prev_input, 3));
   if (return_position) {
-    return (data - (end - len)) +
+    return (data - data_original) +
            utf8_range_ValidateUTF8Naive(data, end, return_position);
   }
   /* Test if there was any error */

end - len is not always equal to data_original. I can't quite figure out why it might be unequal though.

@tonyliaoss
Copy link
Member

Oh I think I know what the problem is. data might be skipped forward due to line 182: https://github.com/protocolbuffers/protobuf/pull/18126/files#diff-4f84906404b1aa9c995fb03b21950c498c4a4b86381887686ec7c7de66fb9834L182

static FORCE_INLINE_ATTR inline size_t utf8_range_Validate(
    const char* data, size_t len, int return_position) {
  if (len == 0) return 1 - return_position;
  const char* const end = data + len;      //// <---- END IS SET HERE
  data = utf8_range_SkipAscii(data, end);  //// <---- DATA IS SKIPPED FORWARD
  /* SIMD algorithm always outperforms the naive version for any data of
     length >=16.
   */
  if (end - data < 16) {
    return (return_position ? (data - (end - len)) : 0) +
           utf8_range_ValidateUTF8Naive(data, end, return_position);
  }
#if defined(__SSE4_1__) || (defined(__ARM_NEON) && defined(__ARM_64BIT_STATE))
  return utf8_range_ValidateUTF8Simd(data, end, return_position);
#else
  return (return_position ? (data - (end - len)) : 0) +
         utf8_range_ValidateUTF8Naive(data, end, return_position);
#endif
}

If I assign const char* const data_original = data; in the first line of utf8_range_Validate, everything works as intended.

If, instead, data_original is assigned within utf8_range_ValidateUTF8Simd, and if data is skipped forward due to utf8_range_SkipAscii(data, end), then the following statement

data_original == end - len;

is not true.

@tonyliaoss
Copy link
Member

I'm going to rerun tests with the following patchset and see what happens:

utf8_range.c
@@ -178,19 +178,22 @@
 static FORCE_INLINE_ATTR inline size_t utf8_range_Validate(
     const char* data, size_t len, int return_position) {
   if (len == 0) return 1 - return_position;
+  // Save buffer start address for later use
+  const char* const data_original = data;
   const char* const end = data + len;
   data = utf8_range_SkipAscii(data, end);
   /* SIMD algorithm always outperforms the naive version for any data of
      length >=16.
    */
   if (end - data < 16) {
-    return (return_position ? (data - (end - len)) : 0) +
+    return (return_position ? (data - data_original) : 0) +
            utf8_range_ValidateUTF8Naive(data, end, return_position);
   }
 #if defined(__SSE4_1__) || (defined(__ARM_NEON) && defined(__ARM_64BIT_STATE))
-  return utf8_range_ValidateUTF8Simd(data, end, return_position);
+  return utf8_range_ValidateUTF8Simd(
+      data_original, data, end, return_position);
 #else
-  return (return_position ? (data - (end - len)) : 0) +
+  return (return_position ? (data - data_original) : 0) +
          utf8_range_ValidateUTF8Naive(data, end, return_position);
 #endif
 }
utf8_range_neon.inc
@@ -7,7 +7,8 @@
  */
 
 static FORCE_INLINE_ATTR inline size_t utf8_range_ValidateUTF8Simd(
-    const char* data, const char* end, int return_position) {
+    const char* const data_original, const char* data,
+    const char* end, int return_position) {
   const uint8x16_t first_len_tbl = {
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 3,
   };
@@ -57,7 +58,6 @@
   uint8x16_t prev_first_len = vdupq_n_u8(0);
   uint8x16_t error = vdupq_n_u8(0);
 
-  const char* const data_original = data;
   while (end - data >= 16) {
     const uint8x16_t input = vld1q_u8((const uint8_t*)data);
utf8_range_sse.inc
@@ -3,7 +3,8 @@
 #include <tmmintrin.h>
 
 static FORCE_INLINE_ATTR inline size_t utf8_range_ValidateUTF8Simd(
-    const char* data, const char* end, int return_position) {
+    const char* const data_original, const char* data,
+    const char* end, int return_position) {
   /* This code checks that utf-8 ranges are structurally valid 16 bytes at once
    * using superscalar instructions.
    * The mapping between ranges of codepoint and their corresponding utf-8
@@ -154,8 +155,6 @@
   __m128i prev_first_len = _mm_set1_epi8(0);
   __m128i error = _mm_set1_epi8(0);
 
-  // Save buffer start address for later use
-  const char* const data_original = data;
   while (end - data >= 16) {
     const __m128i input = _mm_loadu_si128((const __m128i*)(data));

@cyb70289
Copy link
Contributor Author

Ouch! It's very lucky that internal test catches this bug.😓
Thank you for the debugging.

copybara-service bot pushed a commit that referenced this pull request Sep 30, 2024
Protobuf uses utf8_range library for utf8 string validation.
Currently, only SSE implementation is integrated.
This patch adapts utf8_range Neon implementation to protobuf.

Closes #18126

COPYBARA_INTEGRATE_REVIEW=#18126 from cyb70289:utf8-neon 5edbcc2
FUTURE_COPYBARA_INTEGRATE_REVIEW=#18126 from cyb70289:utf8-neon 5edbcc2
PiperOrigin-RevId: 679316668
copybara-service bot pushed a commit that referenced this pull request Sep 30, 2024
Protobuf uses utf8_range library for utf8 string validation.
Currently, only SSE implementation is integrated.
This patch adapts utf8_range Neon implementation to protobuf.

Closes #18126

COPYBARA_INTEGRATE_REVIEW=#18126 from cyb70289:utf8-neon 5edbcc2
FUTURE_COPYBARA_INTEGRATE_REVIEW=#18126 from cyb70289:utf8-neon 5edbcc2
PiperOrigin-RevId: 679316668
copybara-service bot pushed a commit that referenced this pull request Sep 30, 2024
I debugged this previously in PR #18126. There must've been some hiccup in
Copybara ingestion because this patch didn't end up getting picked up.
#18126 (comment)

This is a fix-forward.

PiperOrigin-RevId: 680734793
copybara-service bot pushed a commit that referenced this pull request Sep 30, 2024
I debugged this previously in PR #18126. There must've been some hiccup in
Copybara ingestion because this patch didn't end up getting picked up.
#18126 (comment)

This is a fix-forward.

PiperOrigin-RevId: 680734793
copybara-service bot pushed a commit that referenced this pull request Sep 30, 2024
I debugged this previously in PR #18126. There must've been some hiccup in
Copybara ingestion because this patch didn't end up getting picked up.
#18126 (comment)

This is a fix-forward.

PiperOrigin-RevId: 680757652
@cyb70289 cyb70289 deleted the utf8-neon branch October 1, 2024 07:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants