kaizen: add equals-ignore-case pattern (#340)

* kaizen: add equals-ignore-case pattern addresses: #186 Signed-off-by: Tim Bray <[email protected]> * reset CI/CD to not fail on slowdown Signed-off-by: Tim Bray <[email protected]> * enable github token for benchmarker Signed-off-by: Tim Bray <[email protected]> --------- Signed-off-by: Tim Bray <[email protected]>
timbray · Jul 22, 2024 · 6d34146 · 6d34146 · github-actions · Jul 22, 2024
1 parent 8da5067
commit 6d34146
Show file tree

Hide file tree

Showing 18 changed files with 959 additions and 9 deletions.
diff --git a/.github/workflows/benchmarks.yml b/.github/workflows/benchmarks.yml
@@ -42,6 +42,7 @@ jobs:
         with:
           name: Go Benchmark
           tool: "go"
+          github-token: ${{ secrets.GITHUB_TOKEN }}
 
           # Compare results against json from cache
           output-file-path: output.txt
@@ -53,7 +54,7 @@ jobs:
           # Alert on regression
           alert-threshold: "120%"
           fail-on-alert: false
-          comment-on-alert: false
+          comment-on-alert: true
 
           # Disable github pages, for now.
           auto-push: false
diff --git a/.github/workflows/go-unit-tests.yaml b/.github/workflows/go-unit-tests.yaml
@@ -63,7 +63,7 @@ jobs:
         env:
           COVER_OPTS: ${{ matrix.coveropts }}
           GOFLAGS: ${{ matrix.goflags }}
-        run: go test $COVER_OPTS ./... | tparse -all -notests -format markdown >> $GITHUB_STEP_SUMMARY
+        run: go test $COVER_OPTS | tparse -all -notests -format markdown >> $GITHUB_STEP_SUMMARY
 
       - if: steps.codecov-enabled.outputs.files_exists == 'true'
         name: Upload Codecov Report

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -86,7 +86,7 @@ all the available tests with race-detection enabled, and
 is an essential step before submitting any changes:
 
 ```shell
-go test -race -v -count 1 ./...
+go test -race -v -count 1
 ```
 
 The following command runs the Go linter; submissions 
@@ -101,6 +101,24 @@ in all the Quamina subdirectories so you’ll have to do
 this by hand.  `golangci-lint` has a home page with
 instructions for installing it.
 
+### Rebuilding the Case-folding Table
+
+Quamina's `ignore-case` patterns rely on mappings found
+in the generated source file `case_folding.go`. Quamina
+includes a program called `code_gen` in the `code_gen/`
+directory. There is a `Makefile` whose only function is
+to check the mapping file and rebuild it if it is older
+than three months, because a Unicode version release may
+have added mappings.
+
+As a result, it is a good practice, sometime in the process
+of building and submitting a PR, to type `make` at some
+point, which will rebuild and re-run `code_gen`; that program
+will display a message saying whether or not it rebuilt the
+case-folding mappings. If it did rebuild those mappings, please
+include the generated `case_folding.go` source in your commmit
+and PR.
+
 ## Reporting Bugs and Creating Issues
 
 When opening a new issue, try to roughly follow the commit message format

diff --git a/Makefile b/Makefile
@@ -0,0 +1,5 @@
+# The only purpose of this makefile is to run code_gen/code_gen, which will rebuild the case_folding.go file if
+# it is more than three months out of date
+casefold:
+	@ cd code_gen && go build && cd ..
+	@ code_gen/code_gen
diff --git a/PATTERNS.md b/PATTERNS.md
@@ -179,6 +179,18 @@ The following Shellstyle Patterns would match it:
 {"img": [ {"shellstyle": "https://example.com/*.jpg"} ] }
 {"img": [ {"shellstyle": "https://example.*/*.jpg"} ] }
 ```
+### Equals-Ignore-Case Pattern
+
+The Pattern Type of an Equals-Ignore-Case pattern is `equals-ignore-case`
+and its value **MUST** be a string. Quamina attempts to match with
+case folding in effect, as discussed in Section 3.13 of the Unicode
+Standard. Quamina uses the case-folding mappings provided in the file
+CaseFolding.txt in the Unicode Character Database to generate its mappings.
+Note that case-folding is highly dependent on the specifics of the language
+in use and in certain locales, this default mapping may not produce satisfactory
+results, although results are good for ASCII and "simple" characters from
+other alphabets.
+
 ## EventBridge Patterns
 
 Quamina’s Patterns are inspired by those offered by

diff --git a/README.md b/README.md
@@ -123,6 +123,13 @@ The following Patterns would match it:
   }
 } 
 ```
+```json
+{
+  "Image": {
+    "Title": [ { "equals-ignore-case": "VIEW FROM 15th FLOOR" } ] 
+  }
+}
+```
 The syntax and semantics of Patterns are fully specified
 in [Patterns in Quamina](PATTERNS.md).
 

diff --git a/case_folding.go b/case_folding.go
diff --git a/cl2_test.go b/cl2_test.go
@@ -129,7 +129,35 @@ var (
 			"  }\n" +
 			"}",
 	}
-	shellstyleMatches = []int{490, 713, 43, 2540, 1}
+	shellstyleMatches     = []int{490, 713, 43, 2540, 1}
+	equalsIgnoreCaseRules = []string{
+		"{\n" +
+			"  \"properties\": {\n" +
+			"    \"STREET\": [ { \"equals-ignore-case\": \"jefferson\" } ]\n" +
+			"  }\n" +
+			"}",
+		"{\n" +
+			"  \"properties\": {\n" +
+			"    \"STREET\": [ { \"equals-ignore-case\": \"bEaCh\" } ]\n" +
+			"  }\n" +
+			"}",
+		"{\n" +
+			"  \"properties\": {\n" +
+			"    \"STREET\": [ { \"equals-ignore-case\": \"HyDe\" } ]\n" +
+			"  }\n" +
+			"}",
+		"{\n" +
+			"  \"properties\": {\n" +
+			"    \"STREET\": [ { \"equals-ignore-case\": \"CHESTNUT\" } ]\n" +
+			"  }\n" +
+			"}",
+		"{\n" +
+			"  \"properties\": {\n" +
+			"    \"ST_TYPE\": [ { \"equals-ignore-case\": \"st\" } ]\n" +
+			"  }\n" +
+			"}",
+	}
+	equalsIgnoreCaseMatches = []int{131, 211, 1758, 825, 116386}
 	/* will add when we have numeric
 	complexArraysRules := []string{
 		"{\n" +
@@ -235,6 +263,10 @@ func TestRulerCl2(t *testing.T) {
 	bm = newBenchmarker()
 	bm.addRules(shellstyleRules, shellstyleMatches, true)
 	fmt.Printf("SHELLSTYLE events/sec: %.1f\n", bm.run(t, lines))
+
+	bm = newBenchmarker()
+	bm.addRules(equalsIgnoreCaseRules, equalsIgnoreCaseMatches, true)
+	fmt.Printf("EQUALS_IGNORE-CASE events/sec: %.1f\n", bm.run(t, lines))
 }
 
 type benchmarker struct {

diff --git a/code_gen/build_casefolding_table.go b/code_gen/build_casefolding_table.go
diff --git a/core_matcher.go b/core_matcher.go
@@ -213,7 +213,7 @@ func (m *coreMatcher) matchesForFields(fields []Field) ([]X, error) {
 	// over-equipped M2 MBPro, but probably not on some miserable cloud event-handling worker.
 	// Conclusion: I dunno. I left the analyze() func in but for now, don't use its results in
 	// production.
-	var bufs *bufpair = &bufpair{}
+	var bufs = &bufpair{}
 	/*
 		if cmFields.nfaMeta.maxOutDegree < 2 {
 			bufs = &bufpair{}

diff --git a/core_matcher_test.go b/core_matcher_test.go
@@ -219,6 +219,7 @@ func TestExerciseMatching(t *testing.T) {
 		`{"Image": { "Title": [ {"anything-but":  ["Pikachu", "Eevee"] } ]  } }`,
 		`{"Image": { "Thumbnail": { "Url": [ { "prefix": "https:" } ] } } }`,
 		`{"Image": { "Thumbnail": { "Url": [ "a", { "prefix": "https:" } ] } } }`,
+		`{"Image": { "Title": [ { "equals-ignore-case": "VIEW FROM 15th FLOOR" } ] } }`,
 	}
 
 	var err error
@@ -276,6 +277,7 @@ func TestExerciseMatching(t *testing.T) {
 			t.Error("add one of many: " + err.Error())
 		}
 	}
+	fmt.Println("MS: " + matcherStats(m))
 	matches, err := m.matchesForJSONEvent([]byte(j))
 	if err != nil {
 		t.Error("m4J on all: " + err.Error())

diff --git a/monocase.go b/monocase.go
@@ -0,0 +1,89 @@
+package quamina
+
+import (
+	"errors"
+	"fmt"
+	"unicode/utf8"
+)
+
+func readMonocaseSpecial(pb *patternBuild, valsIn []typedVal) (pathVals []typedVal, err error) {
+	t, err := pb.jd.Token()
+	if err != nil {
+		return
+	}
+	pathVals = valsIn
+
+	monocaseString, ok := t.(string)
+	if !ok {
+		err = errors.New("value for 'prefix' must be a string")
+		return
+	}
+	val := typedVal{
+		vType: monocaseType,
+		val:   `"` + monocaseString + `"`,
+	}
+	pathVals = append(pathVals, val)
+
+	// has to be } or tokenizer will throw error
+	_, err = pb.jd.Token()
+	return
+}
+
+// makeMonocaseFA builds a FA to match "ignore-case" patterns. The Unicode Standard specifies algorithm 3.13,
+// relying on the file CaseFolding.txt in the Unicode Character Database. This function uses the "Simple" flavor
+// of casefolding, i.e. the lines in CaseFolding.txt that are marked with "C". The discussion in the Unicode
+// standard doesn't mention this, but the algorithm essentially replaces upper-case characters with lower-case
+// equivalents.
+// We need to exercise caution to keep from creating states wastefully. For "CAT", after matching '"',
+// you transition on either 'c' or 'C' but in this particular case you want to transition to the same
+// next state. Note that there are many characters in Unicode where the upper and lower case forms are
+// multi-byte and in fact not even the same number of bytes. So in that case you need two paths forward that step
+// through the bytes of each form and then rejoin to arrive at a state. Also note
+// that in many cases the upper/lower case versions of a rune have leading bytes in common
+func makeMonocaseFA(val []byte, pp printer) (*smallTable, *fieldMatcher) {
+	fm := newFieldMatcher()
+	index := 0
+	table := newSmallTable() // start state
+	startTable := table
+	var nextStep *faNext
+	for index < len(val) {
+		var orig, alt []byte
+		r, width := utf8.DecodeRune(val[index:])
+		orig = val[index : index+width]
+		altRune, ok := caseFoldingPairs[r]
+		if ok {
+			alt = make([]byte, utf8.RuneLen(altRune))
+			utf8.EncodeRune(alt, altRune)
+		}
+		nextStep = &faNext{states: []*faState{{table: newSmallTable()}}}
+		pp.labelTable(nextStep.states[0].table, fmt.Sprintf("On %d, alt=%v", val[index], alt))
+		if alt == nil {
+			// easy case, no casefolding issues.  We should maybe try to coalesce these
+			// no-casefolding sections and only call makeFAFragment once for all of them
+			origFA := makeFAFragment(orig, nextStep, pp)
+			table.addByteStep(orig[0], origFA)
+		} else {
+			// two paths to next state
+			// but they might have a common prefix
+			var commonPrefix int
+			for commonPrefix = 0; orig[commonPrefix] == alt[commonPrefix]; commonPrefix++ {
+				prefixNext := &faNext{states: []*faState{{table: newSmallTable()}}}
+				table.addByteStep(orig[commonPrefix], prefixNext)
+				table = prefixNext.states[0].table
+				pp.labelTable(table, fmt.Sprintf("common prologue on %v", orig[commonPrefix]))
+			}
+			// now build automata for the orig and alt versions of the char
+			// TODO: make sure that makeFAFragment works with length == 1
+			origFA := makeFAFragment(orig[commonPrefix:], nextStep, pp)
+			altFA := makeFAFragment(alt[commonPrefix:], nextStep, pp)
+			table.addByteStep(orig[commonPrefix], origFA)
+			table.addByteStep(alt[commonPrefix], altFA)
+		}
+		table = nextStep.states[0].table
+		index += width
+	}
+	laststate := &faState{table: newSmallTable(), fieldTransitions: []*fieldMatcher{fm}}
+	lastStep := &faNext{states: []*faState{laststate}}
+	nextStep.states[0].table.addByteStep(valueTerminator, lastStep)
+	return startTable, fm
+}
Benchmark suite	Current: `6d34146`	Previous: `78e2ec8`	Ratio
`BenchmarkCityLots`	`6836` ns/op 823 B/op 33 allocs/op	`5592` ns/op 773 B/op 31 allocs/op	`1.22`
`BenchmarkCityLots - ns/op`	`6836` ns/op	`5592` ns/op	`1.22`
`Benchmark_JsonFlattner_Evaluate_ContextFields`	`1217` ns/op 96 B/op 8 allocs/op	`726.2` ns/op 56 B/op 4 allocs/op	`1.68`
`Benchmark_JsonFlattner_Evaluate_ContextFields - ns/op`	`1217` ns/op	`726.2` ns/op	`1.68`
`Benchmark_JsonFlattner_Evaluate_ContextFields - B/op`	`96` B/op	`56` B/op	`1.71`
`Benchmark_JsonFlattner_Evaluate_ContextFields - allocs/op`	`8` allocs/op	`4` allocs/op	`2`