Skip to content

Support Unicode codepoint length in length() function #80

Open
@Danialova

Description

The current length() function for strings counts bytes rather than Unicode codepoints, which makes it difficult to work with non-ASCII text. For example:

json := `{"text": "Hello 世界"}`
// Current behavior:
// length($.text) returns 12 (byte count)
// Desired behavior:
// length($.text) returns 8 (character count: "Hello " = 6, "世界" = 2)

Suggested Implementation

Add a new function like strlen() or enhance the existing length() to handle Unicode properly by using utf8.RuneCountInString() from the standard library when operating on string values.

Example implementation approach:

if node.IsString() {
    return NumericNode("length", float64(utf8.RuneCountInString(node.MustString())))
}
// existing array/object length logic...

This would make the library more useful for international text processing and JSONPath queries involving non-ASCII strings.

Benefits

  • More intuitive behavior for string length calculations
  • Better support for international text
  • Consistency with how most programming languages handle string lengths

Let me know if you would like me to provide additional examples or test cases.

Related Go documentation: https://pkg.go.dev/unicode/utf8#RuneCountInString

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions