third_party/utf8_range: support arm neon #18126

cyb70289 · 2024-09-05T05:55:33Z

Protobuf uses utf8_range library for utf8 string validation.
Currently, only SSE implementation is integrated.
This patch adapts utf8_range Neon implementation to protobuf.

cyb70289 · 2024-09-05T05:57:23Z

hi, I'm author of utf8_range, glad to see my lib adopted by protobuf.
This patch adapts utf8_range Neon implementation to protobuf. Please review. Thanks.

cyb70289 · 2024-09-18T06:10:52Z

@acozzette will you have a look at this pr? or someone else can help?

tonyliaoss · 2024-09-18T18:12:58Z

Hello Yibo,

I'll reassign this to @danlark1 who is our ARM SIMD expert. Thanks for making this contribution.

In the meantime, I'll approve this for integration testing.

tonyliaoss

Approve for integration testing.

(Do not submit until another approval from @danlark1)

cyb70289 · 2024-09-19T01:32:06Z

Hmm..., looks this pr leads to java and ruby linux aarch64 job failure, will check.

cyb70289 · 2024-09-19T02:24:13Z

Should have fixed ruby error. Please help start CI jobs.
Not sure of java failure. Does java use this c++ utf8 validation at all?

danlark1

Thanks!

third_party/utf8_range/utf8_range.c

third_party/utf8_range/utf8_range_neon.inc

cyb70289 · 2024-09-20T02:14:38Z

A bit struggling about the jave linux aarch64 job failure. Looks it's related to this PR. Any suggestion is welcomed.
https://github.com/protocolbuffers/protobuf/actions/runs/10951974066/job/30409907883?pr=18126

EDIT: managed to reproduce it locally, debugging... Fixed.

cyb70289 · 2024-09-20T05:00:36Z

This is the key changes to the original utf8_range.c sse validation code after moving arch dependent code to utf8_range_sse.inc. Might be useful for review.
The major difference is to use "end" pointer instead of "len". end = data + len.

diff --git a/utf8_range.c b/utf8_range_sse.inc
index 57a2a9b..b2d3d18 100644
--- a/utf8_range.c
+++ b/utf8_range_sse.inc
@@ -1,5 +1,5 @@
 static FORCE_INLINE_ATTR inline size_t utf8_range_Validate(
-    const char* data, size_t len, int return_position) {
+    const char* data, const char* end, int return_position) {
   /* This code checks that utf-8 ranges are structurally valid 16 bytes at once
    * using superscalar instructions.
    * The mapping between ranges of codepoint and their corresponding utf-8
@@ -149,6 +149,9 @@ static FORCE_INLINE_ATTR inline size_t utf8_range_Validate(
   __m128i prev_input = _mm_set1_epi8(0);
   __m128i prev_first_len = _mm_set1_epi8(0);
   __m128i error = _mm_set1_epi8(0);
+
+  // Save buffer start address for later use
+  const char* const data_original = data;
   while (end - data >= 16) {
     const __m128i input =
         _mm_loadu_si128((const __m128i*)(data));
@@ -249,13 +252,13 @@ static FORCE_INLINE_ATTR inline size_t utf8_range_Validate(
     data += 16;
   }
   /* If we got to the end, we don't need to skip any bytes backwards */
-  if (return_position && (data - (end - len)) == 0) {
+  if (return_position && data == data_original) {
     return utf8_range_ValidateUTF8Naive(data, end, return_position);
   }
   /* Find previous codepoint (not 80~BF) */
   data -= utf8_range_CodepointSkipBackwards(_mm_extract_epi32(prev_input, 3));
   if (return_position) {
-    return (data - (end - len)) +
+    return (data - data_original) +
            utf8_range_ValidateUTF8Naive(data, end, return_position);
   }
   /* Test if there was any error */

cyb70289 · 2024-09-20T06:08:17Z

Hopefully all issues are fixed. Please help trigger CI.

danlark1 · 2024-09-20T07:46:57Z

Approval for code. Thank you a lot!

tonyliaoss

Approving again -- thanks for sending us this PR!

cyb70289 · 2024-09-24T01:49:25Z

How to check CI error feedback/copybara - google internal checks FAILED?

tonyliaoss · 2024-09-25T22:15:26Z

Hi Yibo --

This error message about Copybara failures is saying that it's pending internal integration tests to be run before we can pull it into Google. This CL looks good so far, we just need to get approvals internally to get it integrated into our monorepo (and then we can close this PR).

There is no action needed on your part. We've been a bit busy these few days but hopefully we can get this merged soon.

Protobuf uses utf8_range library for utf8 string validation. Currently, only SSE implementation is integrated. This patch adapts utf8_range Neon implementation to protobuf. Closes #18126 COPYBARA_INTEGRATE_REVIEW=#18126 from cyb70289:utf8-neon 5edbcc2 FUTURE_COPYBARA_INTEGRATE_REVIEW=#18126 from cyb70289:utf8-neon 5edbcc2 PiperOrigin-RevId: 679316668

tonyliaoss · 2024-09-28T01:59:17Z

This is failing our internal integration tests.

I haven't fully debugged what's going on, but I can tell a behavior change happened in the SSE (non-neon) codepath, due to the changes that you mentioned in a previous comment, in these two places:

If I revert these two changes, the regression disappears.

Specifically it seems like the change on this line is somewhat problematic:

   /* Find previous codepoint (not 80~BF) */
   data -= utf8_range_CodepointSkipBackwards(_mm_extract_epi32(prev_input, 3));
   if (return_position) {
-    return (data - (end - len)) +
+    return (data - data_original) +
            utf8_range_ValidateUTF8Naive(data, end, return_position);
   }
   /* Test if there was any error */

end - len is not always equal to data_original. I can't quite figure out why it might be unequal though.

tonyliaoss · 2024-09-28T05:38:25Z

Oh I think I know what the problem is. data might be skipped forward due to line 182: https://github.com/protocolbuffers/protobuf/pull/18126/files#diff-4f84906404b1aa9c995fb03b21950c498c4a4b86381887686ec7c7de66fb9834L182

static FORCE_INLINE_ATTR inline size_t utf8_range_Validate(
    const char* data, size_t len, int return_position) {
  if (len == 0) return 1 - return_position;
  const char* const end = data + len;      //// <---- END IS SET HERE
  data = utf8_range_SkipAscii(data, end);  //// <---- DATA IS SKIPPED FORWARD
  /* SIMD algorithm always outperforms the naive version for any data of
     length >=16.
   */
  if (end - data < 16) {
    return (return_position ? (data - (end - len)) : 0) +
           utf8_range_ValidateUTF8Naive(data, end, return_position);
  }
#if defined(__SSE4_1__) || (defined(__ARM_NEON) && defined(__ARM_64BIT_STATE))
  return utf8_range_ValidateUTF8Simd(data, end, return_position);
#else
  return (return_position ? (data - (end - len)) : 0) +
         utf8_range_ValidateUTF8Naive(data, end, return_position);
#endif
}

If I assign const char* const data_original = data; in the first line of utf8_range_Validate, everything works as intended.

If, instead, data_original is assigned within utf8_range_ValidateUTF8Simd, and if data is skipped forward due to utf8_range_SkipAscii(data, end), then the following statement

data_original == end - len;

is not true.

tonyliaoss · 2024-09-28T18:19:04Z

I'm going to rerun tests with the following patchset and see what happens:

`utf8_range.c`

@@ -178,19 +178,22 @@
 static FORCE_INLINE_ATTR inline size_t utf8_range_Validate(
     const char* data, size_t len, int return_position) {
   if (len == 0) return 1 - return_position;
+  // Save buffer start address for later use
+  const char* const data_original = data;
   const char* const end = data + len;
   data = utf8_range_SkipAscii(data, end);
   /* SIMD algorithm always outperforms the naive version for any data of
      length >=16.
    */
   if (end - data < 16) {
-    return (return_position ? (data - (end - len)) : 0) +
+    return (return_position ? (data - data_original) : 0) +
            utf8_range_ValidateUTF8Naive(data, end, return_position);
   }
 #if defined(__SSE4_1__) || (defined(__ARM_NEON) && defined(__ARM_64BIT_STATE))
-  return utf8_range_ValidateUTF8Simd(data, end, return_position);
+  return utf8_range_ValidateUTF8Simd(
+      data_original, data, end, return_position);
 #else
-  return (return_position ? (data - (end - len)) : 0) +
+  return (return_position ? (data - data_original) : 0) +
          utf8_range_ValidateUTF8Naive(data, end, return_position);
 #endif
 }

`utf8_range_neon.inc`

@@ -7,7 +7,8 @@
  */
 
 static FORCE_INLINE_ATTR inline size_t utf8_range_ValidateUTF8Simd(
-    const char* data, const char* end, int return_position) {
+    const char* const data_original, const char* data,
+    const char* end, int return_position) {
   const uint8x16_t first_len_tbl = {
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 3,
   };
@@ -57,7 +58,6 @@
   uint8x16_t prev_first_len = vdupq_n_u8(0);
   uint8x16_t error = vdupq_n_u8(0);
 
-  const char* const data_original = data;
   while (end - data >= 16) {
     const uint8x16_t input = vld1q_u8((const uint8_t*)data);

`utf8_range_sse.inc`

@@ -3,7 +3,8 @@
 #include <tmmintrin.h>
 
 static FORCE_INLINE_ATTR inline size_t utf8_range_ValidateUTF8Simd(
-    const char* data, const char* end, int return_position) {
+    const char* const data_original, const char* data,
+    const char* end, int return_position) {
   /* This code checks that utf-8 ranges are structurally valid 16 bytes at once
    * using superscalar instructions.
    * The mapping between ranges of codepoint and their corresponding utf-8
@@ -154,8 +155,6 @@
   __m128i prev_first_len = _mm_set1_epi8(0);
   __m128i error = _mm_set1_epi8(0);
 
-  // Save buffer start address for later use
-  const char* const data_original = data;
   while (end - data >= 16) {
     const __m128i input = _mm_loadu_si128((const __m128i*)(data));

cyb70289 · 2024-09-29T01:17:47Z

Ouch! It's very lucky that internal test catches this bug.😓
Thank you for the debugging.

Protobuf uses utf8_range library for utf8 string validation. Currently, only SSE implementation is integrated. This patch adapts utf8_range Neon implementation to protobuf. Closes #18126 COPYBARA_INTEGRATE_REVIEW=#18126 from cyb70289:utf8-neon 5edbcc2 FUTURE_COPYBARA_INTEGRATE_REVIEW=#18126 from cyb70289:utf8-neon 5edbcc2 PiperOrigin-RevId: 679316668

I debugged this previously in PR #18126. There must've been some hiccup in Copybara ingestion because this patch didn't end up getting picked up. #18126 (comment) This is a fix-forward. PiperOrigin-RevId: 680734793

I debugged this previously in PR #18126. There must've been some hiccup in Copybara ingestion because this patch didn't end up getting picked up. #18126 (comment) This is a fix-forward. PiperOrigin-RevId: 680757652

cyb70289 requested a review from a team as a code owner September 5, 2024 05:55

cyb70289 requested review from acozzette and removed request for a team September 5, 2024 05:55

tonyliaoss added the 🅰️ safe for tests Mark a commit as safe to run presubmits over label Sep 18, 2024

github-actions bot removed the 🅰️ safe for tests Mark a commit as safe to run presubmits over label Sep 18, 2024

tonyliaoss requested review from tonyliaoss and removed request for tonyliaoss September 18, 2024 18:18

tonyliaoss approved these changes Sep 18, 2024

View reviewed changes

tonyliaoss added the 🅰️ safe for tests Mark a commit as safe to run presubmits over label Sep 19, 2024

github-actions bot removed the 🅰️ safe for tests Mark a commit as safe to run presubmits over label Sep 19, 2024

danlark1 reviewed Sep 19, 2024

View reviewed changes

third_party/utf8_range/utf8_range.c Outdated Show resolved Hide resolved

third_party/utf8_range/utf8_range_neon.inc Outdated Show resolved Hide resolved

cyb70289 requested a review from a team as a code owner September 20, 2024 02:08

cyb70289 requested review from JasonLunn and removed request for a team September 20, 2024 02:08

tonyliaoss added the 🅰️ safe for tests Mark a commit as safe to run presubmits over label Sep 20, 2024

github-actions bot removed the 🅰️ safe for tests Mark a commit as safe to run presubmits over label Sep 20, 2024

third_party/utf8_range: support arm neon

5edbcc2

tonyliaoss added the 🅰️ safe for tests Mark a commit as safe to run presubmits over label Sep 20, 2024

github-actions bot removed the 🅰️ safe for tests Mark a commit as safe to run presubmits over label Sep 20, 2024

tonyliaoss approved these changes Sep 23, 2024

View reviewed changes

tonyliaoss added the 🅰️ safe for tests Mark a commit as safe to run presubmits over label Sep 23, 2024

github-actions bot removed the 🅰️ safe for tests Mark a commit as safe to run presubmits over label Sep 23, 2024

copybara-service bot mentioned this pull request Sep 26, 2024

third_party/utf8_range: support arm neon (#18126) #18514

Closed

copybara-service bot closed this in d83ad15 Sep 30, 2024

copybara-service bot mentioned this pull request Sep 30, 2024

Fix validation bug in UTF-8 range SIMD subroutine. #18566

Merged

cyb70289 deleted the utf8-neon branch October 1, 2024 07:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

third_party/utf8_range: support arm neon #18126

third_party/utf8_range: support arm neon #18126

cyb70289 commented Sep 5, 2024

cyb70289 commented Sep 5, 2024

cyb70289 commented Sep 18, 2024

tonyliaoss commented Sep 18, 2024

tonyliaoss left a comment

cyb70289 commented Sep 19, 2024

cyb70289 commented Sep 19, 2024

danlark1 left a comment

cyb70289 commented Sep 20, 2024 •

edited

Loading

cyb70289 commented Sep 20, 2024

cyb70289 commented Sep 20, 2024

danlark1 commented Sep 20, 2024

tonyliaoss left a comment

cyb70289 commented Sep 24, 2024

tonyliaoss commented Sep 25, 2024

tonyliaoss commented Sep 28, 2024

tonyliaoss commented Sep 28, 2024

tonyliaoss commented Sep 28, 2024

cyb70289 commented Sep 29, 2024

third_party/utf8_range: support arm neon #18126

third_party/utf8_range: support arm neon #18126

Conversation

cyb70289 commented Sep 5, 2024

cyb70289 commented Sep 5, 2024

cyb70289 commented Sep 18, 2024

tonyliaoss commented Sep 18, 2024

tonyliaoss left a comment

Choose a reason for hiding this comment

cyb70289 commented Sep 19, 2024

cyb70289 commented Sep 19, 2024

danlark1 left a comment

Choose a reason for hiding this comment

cyb70289 commented Sep 20, 2024 • edited Loading

cyb70289 commented Sep 20, 2024

cyb70289 commented Sep 20, 2024

danlark1 commented Sep 20, 2024

tonyliaoss left a comment

Choose a reason for hiding this comment

cyb70289 commented Sep 24, 2024

tonyliaoss commented Sep 25, 2024

tonyliaoss commented Sep 28, 2024

tonyliaoss commented Sep 28, 2024

tonyliaoss commented Sep 28, 2024

utf8_range.c

utf8_range_neon.inc

utf8_range_sse.inc

cyb70289 commented Sep 29, 2024

cyb70289 commented Sep 20, 2024 •

edited

Loading

`utf8_range.c`

`utf8_range_neon.inc`

`utf8_range_sse.inc`