A PCRE lookahead and the case sensitivity

During all my developer’s career I’ve been a big fan of regular expressions. I read books and articles on the topic. I even created a sandbox with common and popular challenges to play around with them. Until recently, I thought I had seen almost all possible tasks and challenges related to regular expressions. However, a few days ago, I came across an interesting case related to case sensitivity which was very new to me.

What is this case? While improving the RAKE PHP library, I found myself in the following situation. Imagine there is a piece of text and the requirement is to remove a bunch of words from it. One of the possible solutions is to use a regular expression.

$str = 'My new home is ...';
preg_replace('#\bnew\b#i', '', $str); // "My  home is ..."

Pretty easy so far. But, what if we need to exclude a word in a specific form (by the form I mean a word that uses different cases)? For example, my new home is New York. Obviously, I don't want to remove New from New York - it doesn't make any sense. We can improve the regular expression by using a negative lookahead assertion.

$str = 'My new home is New York City.';
preg_replace('#\b(?!New)new\b#i', '', $str); // "My new home is New York City."

But wait?! This is not going to work because the whole expression is case-insensitive. This is where things start to be interesting.

In PCRE there is the possibility to enable or disable the case-insensitive behavior for a part of an expression.

  • (?i) the case-insensitive behavior is turned on
  • (?-i) the case-insensitive behavior is turned off

Knowing this, we can adjust the expression - turn the case-insensitivity off before the negative lookahead, then turn it on again.

$str = 'My new home is New York City.';
preg_replace('#\b(?-i)(?!New)(?i)new\b#i', '', $str); // "My  home is New York City."

Now, the regular expression works as expected.

P.S. The most fascinating part about this is that this functionality is a feature of PCRE. Therefore, this regex pattern can be used in any language which supports this library. For example, in Java:

public class RegexNegativeLookahead {
    public static void main(String[] args) {
        String str = "My new home is New York City.";
        String result = str.replaceAll("(?i)\\b(?-i)(?!New)(?i)new\\b", "");

        System.out.println(result); // "My  home is New York City."
    }
}

Comment this page: