Open
Description
The current length()
function for strings counts bytes rather than Unicode codepoints, which makes it difficult to work with non-ASCII text. For example:
json := `{"text": "Hello 世界"}`
// Current behavior:
// length($.text) returns 12 (byte count)
// Desired behavior:
// length($.text) returns 8 (character count: "Hello " = 6, "世界" = 2)
Suggested Implementation
Add a new function like strlen()
or enhance the existing length()
to handle Unicode properly by using utf8.RuneCountInString()
from the standard library when operating on string values.
Example implementation approach:
if node.IsString() {
return NumericNode("length", float64(utf8.RuneCountInString(node.MustString())))
}
// existing array/object length logic...
This would make the library more useful for international text processing and JSONPath queries involving non-ASCII strings.
Benefits
- More intuitive behavior for string length calculations
- Better support for international text
- Consistency with how most programming languages handle string lengths
Let me know if you would like me to provide additional examples or test cases.
Related Go documentation: https://pkg.go.dev/unicode/utf8#RuneCountInString
Metadata
Assignees
Labels
No labels
Activity