Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable regex flags manipulation #1193

Merged
merged 12 commits into from
Oct 26, 2023
Merged

Enable regex flags manipulation #1193

merged 12 commits into from
Oct 26, 2023

Conversation

omri374
Copy link
Contributor

@omri374 omri374 commented Oct 24, 2023

Change Description

As a user I would like to customize the regex flags used by Presidio, to be able to pass flags not defined initially.

Regex flags can be set on the PatternRecognizer's constructor, or in the RecognizerRegistry constructor to support all existing PatternRecognizers:

from presidio_analyzer import AnalyzerEngine, RecognizerRegistry

import regex as re

registry = RecognizerRegistry(global_regex_flags=re.DOTALL | re.MULTILINE | re.IGNORECASE)
engine = AnalyzerEngine(registry=registry)
engine.analyze(...)

Issue reference

This PR fixes issue #1029

Checklist

  • I have reviewed the contribution guidelines
  • I have signed the CLA (if required)
  • My code includes unit tests
  • All unit tests and lint checks pass locally
  • My PR contains documentation updates / additions if required

# Conflicts:
#	mkdocs.yml
#	presidio-analyzer/Pipfile
#	presidio-analyzer/conf/default.yaml
#	presidio-analyzer/conf/spacy.yaml
#	presidio-analyzer/conf/spacy_multilingual.yaml
#	presidio-analyzer/conf/stanza.yaml
#	presidio-analyzer/conf/stanza_multilingual.yaml
#	presidio-analyzer/conf/transformers.yaml
#	presidio-analyzer/tests/conf/default.yaml
#	presidio-analyzer/tests/test_stanza_recognizer.py
@omri374
Copy link
Contributor Author

omri374 commented Oct 24, 2023

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@omri374 omri374 marked this pull request as ready for review October 24, 2023 16:55
SharonHart
SharonHart previously approved these changes Oct 25, 2023
@omri374 omri374 requested a review from navalev October 25, 2023 12:26
navalev
navalev previously approved these changes Oct 25, 2023
SharonHart
SharonHart previously approved these changes Oct 25, 2023
@omri374 omri374 dismissed stale reviews from SharonHart and navalev via f334566 October 25, 2023 12:57
@omri374 omri374 merged commit b756c17 into main Oct 26, 2023
@SharonHart SharonHart deleted the omri/regex_flags branch October 31, 2023 11:17
@@ -37,9 +39,9 @@ def __init__(
deny_list: List[str] = None,
context: List[str] = None,
deny_list_score: float = 1.0,
global_regex_flags: Optional[int] = re.DOTALL | re.MULTILINE | re.IGNORECASE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty bad. You just change the default value from re.DOTALL | re.MULTILINE to re.DOTALL | re.MULTILINE | re.IGNORECASE.

This affect us as a client of this library.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ducquangkstn thanks for the feedback. This change allows you to have more customizability with regards to regex flags. Is this blocking you in any way?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this blocking you in any way?

Actually, no. It just took me a while to figure out why the behavior is changed when pumping presidio version.

We (my company) are lucky that we have some unit tests. Not sure about other ppl.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants