regexp: confusing behavior on invalid utf-8 sequences

The following program:

``` go
package main

import "regexp"

func main() {
    re := regexp.MustCompile(".")
    println(re.MatchString("\xd1"))
    println(re.MatchString("\xd1\x84"))
    println(re.MatchString("\xd1\xd1"))
    re = regexp.MustCompile("..")
    println(re.MatchString("\xd1"))
    println(re.MatchString("\xd1\x84"))
    println(re.MatchString("\xd1\xd1"))
}
```

prints:

```
true
true
true
false
false
true
```

While the following C++ program:

``` c
#include <stdio.h>
#include <re2/re2.h>

int main() {
    RE2 re1(".");
    printf("%d\n", RE2::PartialMatch("\xd1", re1));
    printf("%d\n", RE2::PartialMatch("\xd1\x84", re1));
    printf("%d\n", RE2::PartialMatch("\xd1\xd1", re1));
    RE2 re2(".");
    printf("%d\n", RE2::PartialMatch("\xd1", re2));
    printf("%d\n", RE2::PartialMatch("\xd1\x84", re2));
    printf("%d\n", RE2::PartialMatch("\xd1\xd1", re2));
}
```

prints:

```
0
1
0
0
1
0
```

This raises 2 questions:
1. Why is behavior different between regexp and re2 (re2 seems to be more consistent)?
2. Why is "\xd1\xd1" matched against both "." and ".."? I can understand if it is matched against one or another, but not both; is it one character or two?

go version devel +b0532a9 Mon Jun 8 05:13:15 2015 +0000 linux/amd64


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

regexp: confusing behavior on invalid utf-8 sequences #11185

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development