A PCRE lookahead and the case sensitivity
During all my developer’s career I’ve been a big fan of regular expressions. I read books and articles on the topic. I even created a sandbox with common and popular challenges to play around with them. Until recently, I thought I had seen almost all possible tasks and challenges related to regular expressions. However, a few days ago, I came across an interesting case related to case sensitivity which was very new to me.
What is this case? While improving the RAKE PHP library, I found myself in the following situation. Imagine there is a piece of text and the requirement is to remove a bunch of words from it. One of the possible solutions is to use a regular expression.
$str = 'My new home is ...';
preg_replace('#\bnew\b#i', '', $str); // "My home is ..."
Pretty easy so far. But, what if we need to exclude a word in a specific form (by the form I mean a word that uses different cases)? For example, my new home is New York
. Obviously, I don't want to remove New
from New York
- it doesn't make any sense. We can improve the regular expression by using a negative lookahead assertion.
$str = 'My new home is New York City.';
preg_replace('#\b(?!New)new\b#i', '', $str); // "My new home is New York City."
But wait?! This is not going to work because the whole expression is case-insensitive. This is where things start to be interesting.
In PCRE there is the possibility to enable or disable the case-insensitive behavior for a part of an expression.
- (?i) the case-insensitive behavior is turned on
- (?-i) the case-insensitive behavior is turned off
Knowing this, we can adjust the expression - turn the case-insensitivity off before the negative lookahead, then turn it on again.
$str = 'My new home is New York City.';
preg_replace('#\b(?-i)(?!New)(?i)new\b#i', '', $str); // "My home is New York City."
Now, the regular expression works as expected.
P.S. The most fascinating part about this is that this functionality is a feature of PCRE. Therefore, this regex pattern can be used in any language which supports this library. For example, in Java:
public class RegexNegativeLookahead {
public static void main(String[] args) {
String str = "My new home is New York City.";
String result = str.replaceAll("(?i)\\b(?-i)(?!New)(?i)new\\b", "");
System.out.println(result); // "My home is New York City."
}
}
Comment this page: