Skip to content

regexp: confusing behavior on invalid utf-8 sequences #11185

Closed
@dvyukov

Description

The following program:

package main

import "regexp"

func main() {
    re := regexp.MustCompile(".")
    println(re.MatchString("\xd1"))
    println(re.MatchString("\xd1\x84"))
    println(re.MatchString("\xd1\xd1"))
    re = regexp.MustCompile("..")
    println(re.MatchString("\xd1"))
    println(re.MatchString("\xd1\x84"))
    println(re.MatchString("\xd1\xd1"))
}

prints:

true
true
true
false
false
true

While the following C++ program:

#include <stdio.h>
#include <re2/re2.h>

int main() {
    RE2 re1(".");
    printf("%d\n", RE2::PartialMatch("\xd1", re1));
    printf("%d\n", RE2::PartialMatch("\xd1\x84", re1));
    printf("%d\n", RE2::PartialMatch("\xd1\xd1", re1));
    RE2 re2(".");
    printf("%d\n", RE2::PartialMatch("\xd1", re2));
    printf("%d\n", RE2::PartialMatch("\xd1\x84", re2));
    printf("%d\n", RE2::PartialMatch("\xd1\xd1", re2));
}

prints:

0
1
0
0
1
0

This raises 2 questions:

  1. Why is behavior different between regexp and re2 (re2 seems to be more consistent)?
  2. Why is "\xd1\xd1" matched against both "." and ".."? I can understand if it is matched against one or another, but not both; is it one character or two?

go version devel +b0532a9 Mon Jun 8 05:13:15 2015 +0000 linux/amd64

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions