-
Notifications
You must be signed in to change notification settings - Fork 603
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added regex functionality for allow lists in the analyzer #1357
Conversation
@microsoft-github-policy-service agree |
Thanks! How would you suggest to differentiate this from the original allow list capabilities? When should users use which one? |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
The regex functionality would make the original allow lists obsolete. This regex allow list includes the functionality of the regular allow lists.
This regex makes the allow list more useful. For example if I want to allow a certain company to pass then i would like all of its variants to pass: company, COMPANY, Company, Company group, Company LLC. This is not possible with the current setup.
If we would include this regex_allow_list, then we could in essence delete the regular allow list.
If one would like to keep the regular setup they would have to change everything from "company" to "\bcompany$" with no flags.
…________________________________
Van: Omri Mendels ***@***.***>
Verstuurd: zondag, april 14, 2024 2:36:02 p.m.
Aan: microsoft/presidio ***@***.***>
CC: NarekAra ***@***.***>; Mention ***@***.***>
Onderwerp: Re: [microsoft/presidio] added regex functionality for allow lists in the analyzer (PR #1357)
Thanks! How would you suggest to differentiate this from the original allow list capabilities? When should users use which one?
—
Reply to this email directly, view it on GitHub<#1357 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AZFSVQFWIX5ZR5WTCBAG6IDY5JZ27AVCNFSM6AAAAABGFY3CVSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJUGAZTKMRTGY>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Is it not possible because of regex flags? Or something else? |
If you could give an example where you call the new |
Let's take an example:
I love travelling to New York City.
I love new york.
I love London!
The NER would flag New York City, new york and London as a location. Let's say that we would allow everything new york to be whitelisted and London should be flagged. In the current setup we would have to list all possible scenarios of new york being mentioned:
"New York", "new york", "New york", "NEW YORK CITY"... This is because currently the allow list are based on exact matches. This is not practical, my allow lists currently have 10+ versions of a certain word. Would be much better to whitelist based on regex. Then simply "New York" would cover all these cases.
…________________________________
From: Omri Mendels ***@***.***>
Sent: Monday, April 15, 2024 4:58:55 PM
To: microsoft/presidio ***@***.***>
Cc: NarekAra ***@***.***>; Mention ***@***.***>
Subject: Re: [microsoft/presidio] added regex functionality for allow lists in the analyzer (PR #1357)
If you could give an example where you call the new regrx_allow_list that'd be great.
—
Reply to this email directly, view it on GitHub<#1357 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AZFSVQAFWXGQ7NXI7Y346WDY5PTK7AVCNFSM6AAAAABGFY3CVSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJXGA3DQMZUGM>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
If we made the deny list case insensitive, would that support your use case? |
A deny list applies to the pattern recognizer which is used to detect entities. In my use case I want to remove entities.
The goal is to have:
I love travelling to New York City.
I love new york.
I love <LOCATION>!
London and new york are detected as LOCATION by Spacy NER, but only London will be masked as I will pass regex_allow_list = ["New york"] and regex_flags: re.IGNORECASE
Without the regex_allow_list I would have to pass ["New York City", "new york"] and many more variations for future use cases.
…________________________________
From: Omri Mendels ***@***.***>
Sent: Tuesday, April 16, 2024 6:49:35 AM
To: microsoft/presidio ***@***.***>
Cc: NarekAra ***@***.***>; Mention ***@***.***>
Subject: Re: [microsoft/presidio] added regex functionality for allow lists in the analyzer (PR #1357)
If we made the deny list case insensitive, would that support your use case?
—
Reply to this email directly, view it on GitHub<#1357 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AZFSVQCBNHOVHW6TRRPK3KDY5SUV7AVCNFSM6AAAAABGFY3CVSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJYGIZDEMJTG4>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Sorry, I meant allow list :) if we make the existing allow list case insensitive, would that be ok? I would rather not add more parameters or capabilities as it's confusing to users, and I'd rather not remove parameters as it's not backward compatible |
This would improve the allow lists a lot, however it will not include variants of a word that are sometimes unpredictable. New york can be written in at least 5 different ways, each with different possibilities for capitalization.
I understand that we cannot remove the allow list as is, as it would break applications. That's why I let the regular allow_list as is.
What we could do is include all the functionality in allow_list. We could add a boolean as an additional parameter that signifies if the allow list should be interpreted as regex strings or as exact matches, and an additional parameter to modify the flags. These two can be turned off by default, which means the functionality is the same as it is now.
I do believe this is a beneficial update, as per my usecase I do not use the allow list from Presidio at all, I mody the results of the analyzer myself with regex.
…________________________________
From: Omri Mendels ***@***.***>
Sent: Tuesday, April 16, 2024 10:40:06 AM
To: microsoft/presidio ***@***.***>
Cc: NarekAra ***@***.***>; Mention ***@***.***>
Subject: Re: [microsoft/presidio] added regex functionality for allow lists in the analyzer (PR #1357)
Sorry, I meant allow list :) if we make the existing allow list case insensitive, would that be ok? I would rather not add more parameters or capabilities as it's confusing to users, and I'd rather not remove parameters as it's not backward compatible
—
Reply to this email directly, view it on GitHub<#1357 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AZFSVQGGKWJJEAHUHQGR3X3Y5TPWNAVCNFSM6AAAAABGFY3CVSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJYGU2TENBZGQ>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Enhancing the existing capability with a parameters on how to process it sounds like a great improvement! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, this is a great addition! Left a few comments mainly around being more explicit on what this does and how.
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Approved
Great, thanks for reviewing! Indeed, a big improvement :) |
Change Description
Although the allow_list functionality is useful, it is not very practical as regex functionality is not handled here. I added this while still allowing the regular allow list to exist.
Issue reference
This PR fixes issue NONE
Checklist