Skip to content

Commit

Permalink
kaizen: add equals-ignore-case pattern (#340)
Browse files Browse the repository at this point in the history
* kaizen: add equals-ignore-case pattern

addresses: #186
Signed-off-by: Tim Bray <[email protected]>

* reset CI/CD to not fail on slowdown

Signed-off-by: Tim Bray <[email protected]>

* enable github token for benchmarker

Signed-off-by: Tim Bray <[email protected]>

---------

Signed-off-by: Tim Bray <[email protected]>
  • Loading branch information
timbray authored Jul 22, 2024
1 parent 8da5067 commit 6d34146
Show file tree
Hide file tree
Showing 18 changed files with 959 additions and 9 deletions.
3 changes: 2 additions & 1 deletion .github/workflows/benchmarks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ jobs:
with:
name: Go Benchmark
tool: "go"
github-token: ${{ secrets.GITHUB_TOKEN }}

# Compare results against json from cache
output-file-path: output.txt
Expand All @@ -53,7 +54,7 @@ jobs:
# Alert on regression
alert-threshold: "120%"
fail-on-alert: false
comment-on-alert: false
comment-on-alert: true

# Disable github pages, for now.
auto-push: false
2 changes: 1 addition & 1 deletion .github/workflows/go-unit-tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ jobs:
env:
COVER_OPTS: ${{ matrix.coveropts }}
GOFLAGS: ${{ matrix.goflags }}
run: go test $COVER_OPTS ./... | tparse -all -notests -format markdown >> $GITHUB_STEP_SUMMARY
run: go test $COVER_OPTS | tparse -all -notests -format markdown >> $GITHUB_STEP_SUMMARY

- if: steps.codecov-enabled.outputs.files_exists == 'true'
name: Upload Codecov Report
Expand Down
20 changes: 19 additions & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ all the available tests with race-detection enabled, and
is an essential step before submitting any changes:

```shell
go test -race -v -count 1 ./...
go test -race -v -count 1
```

The following command runs the Go linter; submissions
Expand All @@ -101,6 +101,24 @@ in all the Quamina subdirectories so you’ll have to do
this by hand. `golangci-lint` has a home page with
instructions for installing it.

### Rebuilding the Case-folding Table

Quamina's `ignore-case` patterns rely on mappings found
in the generated source file `case_folding.go`. Quamina
includes a program called `code_gen` in the `code_gen/`
directory. There is a `Makefile` whose only function is
to check the mapping file and rebuild it if it is older
than three months, because a Unicode version release may
have added mappings.

As a result, it is a good practice, sometime in the process
of building and submitting a PR, to type `make` at some
point, which will rebuild and re-run `code_gen`; that program
will display a message saying whether or not it rebuilt the
case-folding mappings. If it did rebuild those mappings, please
include the generated `case_folding.go` source in your commmit
and PR.

## Reporting Bugs and Creating Issues

When opening a new issue, try to roughly follow the commit message format
Expand Down
5 changes: 5 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# The only purpose of this makefile is to run code_gen/code_gen, which will rebuild the case_folding.go file if
# it is more than three months out of date
casefold:
@ cd code_gen && go build && cd ..
@ code_gen/code_gen
12 changes: 12 additions & 0 deletions PATTERNS.md
Original file line number Diff line number Diff line change
Expand Up @@ -179,6 +179,18 @@ The following Shellstyle Patterns would match it:
{"img": [ {"shellstyle": "https://example.com/*.jpg"} ] }
{"img": [ {"shellstyle": "https://example.*/*.jpg"} ] }
```
### Equals-Ignore-Case Pattern

The Pattern Type of an Equals-Ignore-Case pattern is `equals-ignore-case`
and its value **MUST** be a string. Quamina attempts to match with
case folding in effect, as discussed in Section 3.13 of the Unicode
Standard. Quamina uses the case-folding mappings provided in the file
CaseFolding.txt in the Unicode Character Database to generate its mappings.
Note that case-folding is highly dependent on the specifics of the language
in use and in certain locales, this default mapping may not produce satisfactory
results, although results are good for ASCII and "simple" characters from
other alphabets.

## EventBridge Patterns

Quamina’s Patterns are inspired by those offered by
Expand Down
7 changes: 7 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,13 @@ The following Patterns would match it:
}
}
```
```json
{
"Image": {
"Title": [ { "equals-ignore-case": "VIEW FROM 15th FLOOR" } ]
}
}
```
The syntax and semantics of Patterns are fully specified
in [Patterns in Quamina](PATTERNS.md).

Expand Down
477 changes: 477 additions & 0 deletions case_folding.go

Large diffs are not rendered by default.

34 changes: 33 additions & 1 deletion cl2_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,35 @@ var (
" }\n" +
"}",
}
shellstyleMatches = []int{490, 713, 43, 2540, 1}
shellstyleMatches = []int{490, 713, 43, 2540, 1}
equalsIgnoreCaseRules = []string{
"{\n" +
" \"properties\": {\n" +
" \"STREET\": [ { \"equals-ignore-case\": \"jefferson\" } ]\n" +
" }\n" +
"}",
"{\n" +
" \"properties\": {\n" +
" \"STREET\": [ { \"equals-ignore-case\": \"bEaCh\" } ]\n" +
" }\n" +
"}",
"{\n" +
" \"properties\": {\n" +
" \"STREET\": [ { \"equals-ignore-case\": \"HyDe\" } ]\n" +
" }\n" +
"}",
"{\n" +
" \"properties\": {\n" +
" \"STREET\": [ { \"equals-ignore-case\": \"CHESTNUT\" } ]\n" +
" }\n" +
"}",
"{\n" +
" \"properties\": {\n" +
" \"ST_TYPE\": [ { \"equals-ignore-case\": \"st\" } ]\n" +
" }\n" +
"}",
}
equalsIgnoreCaseMatches = []int{131, 211, 1758, 825, 116386}
/* will add when we have numeric
complexArraysRules := []string{
"{\n" +
Expand Down Expand Up @@ -235,6 +263,10 @@ func TestRulerCl2(t *testing.T) {
bm = newBenchmarker()
bm.addRules(shellstyleRules, shellstyleMatches, true)
fmt.Printf("SHELLSTYLE events/sec: %.1f\n", bm.run(t, lines))

bm = newBenchmarker()
bm.addRules(equalsIgnoreCaseRules, equalsIgnoreCaseMatches, true)
fmt.Printf("EQUALS_IGNORE-CASE events/sec: %.1f\n", bm.run(t, lines))
}

type benchmarker struct {
Expand Down
122 changes: 122 additions & 0 deletions code_gen/build_casefolding_table.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion core_matcher.go
Original file line number Diff line number Diff line change
Expand Up @@ -213,7 +213,7 @@ func (m *coreMatcher) matchesForFields(fields []Field) ([]X, error) {
// over-equipped M2 MBPro, but probably not on some miserable cloud event-handling worker.
// Conclusion: I dunno. I left the analyze() func in but for now, don't use its results in
// production.
var bufs *bufpair = &bufpair{}
var bufs = &bufpair{}
/*
if cmFields.nfaMeta.maxOutDegree < 2 {
bufs = &bufpair{}
Expand Down
2 changes: 2 additions & 0 deletions core_matcher_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -219,6 +219,7 @@ func TestExerciseMatching(t *testing.T) {
`{"Image": { "Title": [ {"anything-but": ["Pikachu", "Eevee"] } ] } }`,
`{"Image": { "Thumbnail": { "Url": [ { "prefix": "https:" } ] } } }`,
`{"Image": { "Thumbnail": { "Url": [ "a", { "prefix": "https:" } ] } } }`,
`{"Image": { "Title": [ { "equals-ignore-case": "VIEW FROM 15th FLOOR" } ] } }`,
}

var err error
Expand Down Expand Up @@ -276,6 +277,7 @@ func TestExerciseMatching(t *testing.T) {
t.Error("add one of many: " + err.Error())
}
}
fmt.Println("MS: " + matcherStats(m))
matches, err := m.matchesForJSONEvent([]byte(j))
if err != nil {
t.Error("m4J on all: " + err.Error())
Expand Down
89 changes: 89 additions & 0 deletions monocase.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
package quamina

import (
"errors"
"fmt"
"unicode/utf8"
)

func readMonocaseSpecial(pb *patternBuild, valsIn []typedVal) (pathVals []typedVal, err error) {
t, err := pb.jd.Token()
if err != nil {
return
}
pathVals = valsIn

monocaseString, ok := t.(string)
if !ok {
err = errors.New("value for 'prefix' must be a string")
return
}
val := typedVal{
vType: monocaseType,
val: `"` + monocaseString + `"`,
}
pathVals = append(pathVals, val)

// has to be } or tokenizer will throw error
_, err = pb.jd.Token()
return
}

// makeMonocaseFA builds a FA to match "ignore-case" patterns. The Unicode Standard specifies algorithm 3.13,
// relying on the file CaseFolding.txt in the Unicode Character Database. This function uses the "Simple" flavor
// of casefolding, i.e. the lines in CaseFolding.txt that are marked with "C". The discussion in the Unicode
// standard doesn't mention this, but the algorithm essentially replaces upper-case characters with lower-case
// equivalents.
// We need to exercise caution to keep from creating states wastefully. For "CAT", after matching '"',
// you transition on either 'c' or 'C' but in this particular case you want to transition to the same
// next state. Note that there are many characters in Unicode where the upper and lower case forms are
// multi-byte and in fact not even the same number of bytes. So in that case you need two paths forward that step
// through the bytes of each form and then rejoin to arrive at a state. Also note
// that in many cases the upper/lower case versions of a rune have leading bytes in common
func makeMonocaseFA(val []byte, pp printer) (*smallTable, *fieldMatcher) {
fm := newFieldMatcher()
index := 0
table := newSmallTable() // start state
startTable := table
var nextStep *faNext
for index < len(val) {
var orig, alt []byte
r, width := utf8.DecodeRune(val[index:])
orig = val[index : index+width]
altRune, ok := caseFoldingPairs[r]
if ok {
alt = make([]byte, utf8.RuneLen(altRune))
utf8.EncodeRune(alt, altRune)
}
nextStep = &faNext{states: []*faState{{table: newSmallTable()}}}
pp.labelTable(nextStep.states[0].table, fmt.Sprintf("On %d, alt=%v", val[index], alt))
if alt == nil {
// easy case, no casefolding issues. We should maybe try to coalesce these
// no-casefolding sections and only call makeFAFragment once for all of them
origFA := makeFAFragment(orig, nextStep, pp)
table.addByteStep(orig[0], origFA)
} else {
// two paths to next state
// but they might have a common prefix
var commonPrefix int
for commonPrefix = 0; orig[commonPrefix] == alt[commonPrefix]; commonPrefix++ {
prefixNext := &faNext{states: []*faState{{table: newSmallTable()}}}
table.addByteStep(orig[commonPrefix], prefixNext)
table = prefixNext.states[0].table
pp.labelTable(table, fmt.Sprintf("common prologue on %v", orig[commonPrefix]))
}
// now build automata for the orig and alt versions of the char
// TODO: make sure that makeFAFragment works with length == 1
origFA := makeFAFragment(orig[commonPrefix:], nextStep, pp)
altFA := makeFAFragment(alt[commonPrefix:], nextStep, pp)
table.addByteStep(orig[commonPrefix], origFA)
table.addByteStep(alt[commonPrefix], altFA)
}
table = nextStep.states[0].table
index += width
}
laststate := &faState{table: newSmallTable(), fieldTransitions: []*fieldMatcher{fm}}
lastStep := &faNext{states: []*faState{laststate}}
nextStep.states[0].table.addByteStep(valueTerminator, lastStep)
return startTable, fm
}
Loading

1 comment on commit 6d34146

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Go Benchmark'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.20.

Benchmark suite Current: 6d34146 Previous: 78e2ec8 Ratio
BenchmarkCityLots 6836 ns/op 823 B/op 33 allocs/op 5592 ns/op 773 B/op 31 allocs/op 1.22
BenchmarkCityLots - ns/op 6836 ns/op 5592 ns/op 1.22
Benchmark_JsonFlattner_Evaluate_ContextFields 1217 ns/op 96 B/op 8 allocs/op 726.2 ns/op 56 B/op 4 allocs/op 1.68
Benchmark_JsonFlattner_Evaluate_ContextFields - ns/op 1217 ns/op 726.2 ns/op 1.68
Benchmark_JsonFlattner_Evaluate_ContextFields - B/op 96 B/op 56 B/op 1.71
Benchmark_JsonFlattner_Evaluate_ContextFields - allocs/op 8 allocs/op 4 allocs/op 2

This comment was automatically generated by workflow using github-action-benchmark.

Please sign in to comment.