Java RegEx mini tutotial replacing tricky characters

Recently I have been looking into DSL for expressing order promotion eligibilities. The backing engine for this was Groovy that allows to produce runtime scripts from user input. This makes very powerful since full potential of your application can be used in creating promotion conditions. However with this power comes the danger that users may misuse the language provided and mutate objects. Therefore a decision was made to replace all assignment (=) operators with equality (==) when preprocessing user input thus disallowing any updates to the objects in evaluation context.

So basically we needed to replace = with == in a multiline groovy script snippet.

In order to do this we need to consider the following cases:

  • line beginning with =
  • line ending with =
  • line that contains =
  • must not replace ==, >= and <= as these are valid read only comparison operators

Now we can do this by effectively writting handlers for each of those cases. However with java RegEx lookahead (?) and lookbehind (?<) operators we can easily to this in one operation.

Although, I knew exactly what to do RegEx always have a knock-out effect on me. So let go through this mini tutorial to see exactly what is going on.

So we can rethink our requirement as:

  • We are looking for = character
  • It must not be preceeded by =, < or >
  • It must not be followed by =, < or >

The RegEx looks like this: "(?<])=(?![=><])

= - middle part looking for = character which may be precedeed or followed by new line.

(?<]) - negative lookbehind (i.e. ? to make sure this is really an assignment operator and not ==, >= or <=

(?![=><]) - negative lookforward (i.e. ?!) for =, < or > to make sure this really an assignment operator and not ==, => or =<. 

There are few more issues that we need to fix here:

  • We should also clean new lines inside the assignment as this can cause issue
  • => and =< are not actually valid in terms or java operators, so there should be a replace call to make them >= and <=.

To overcome the first problem we need the following RegEx: (\s*)([=><]+)(\s*) that has to be used with Pattern.MULTILINE flag.

(\s*) - Any number of whitespace characters, short for [ \t\n\x0b\r\f] preceeding or following

([=><]+) - Any combination of =, < or > with at least one character, which account for all kinds of comparisons including assignment (e.g. =>, >, =, ==, === and so on). Note the + inside the brackets - we want to capture all characters.

Replacing => and =< does not need a RegEx since it can be done using a simple String.replace()

Lets see this code in action in JUnit:

package denispavlov.sandbox.regex;

import org.junit.Test;

import java.util.regex.Matcher;
import java.util.regex.Pattern;

import static org.junit.Assert.assertEquals;

public class RegExTest {

    String ensureNoAssignmentsUsed(final String expression) {

        final Pattern noNewLinesPattern = Pattern.compile("(\\s*)([=><]+)(\\s*)", Pattern.MULTILINE);
        final Matcher noNewLinesMatcher = noNewLinesPattern.matcher(expression);
        String noNewLinesExpression = noNewLinesMatcher.replaceAll("$2");

        final Pattern noAssignPattern = Pattern.compile("(?<])=(?![=><])");
        final Matcher noAssignMatcher = noAssignPattern.matcher(noNewLinesExpression);
        String noAssignExpression = noAssignMatcher.replaceAll("==");

        noAssignExpression = noAssignExpression.replace("=<", "<=");
        noAssignExpression = noAssignExpression.replace("=>", ">=");

        return noAssignExpression;
    }


    @Test
    public void testRegExLookaround() throws Exception {


        // Preceding
        assertEquals("X==Y", ensureNoAssignmentsUsed("X\n=Y"));
        assertEquals("X>=Y", ensureNoAssignmentsUsed("X\n>=Y"));
        assertEquals("X<=Y", ensureNoAssignmentsUsed("X\n<=Y"));
        assertEquals("X>=Y", ensureNoAssignmentsUsed("X\n=>Y"));
        assertEquals("X<=Y", ensureNoAssignmentsUsed("X\n==Y", ensureNoAssignmentsUsed("X>=\nY"));
        assertEquals("X<=Y", ensureNoAssignmentsUsed("X<=\nY"));
        assertEquals("X>=Y", ensureNoAssignmentsUsed("X=>\nY"));
        assertEquals("X<=Y", ensureNoAssignmentsUsed("X=<\nY"));
        assertEquals("X==Y", ensureNoAssignmentsUsed("X==\nY"));

        // Own line
        assertEquals("X==Y", ensureNoAssignmentsUsed("X\n=\nY"));
        assertEquals("X>=Y", ensureNoAssignmentsUsed("X\n>=\nY"));
        assertEquals("X<=Y", ensureNoAssignmentsUsed("X\n<=\nY"));
        assertEquals("X>=Y", ensureNoAssignmentsUsed("X\n=>\nY"));
        assertEquals("X<=Y", ensureNoAssignmentsUsed("X\n=<\nY"));
        assertEquals("X==Y", ensureNoAssignmentsUsed("X\n==\nY"));

        // Middle
        assertEquals("X==Y", ensureNoAssignmentsUsed("X = Y"));
        assertEquals("X>=Y", ensureNoAssignmentsUsed("X >= Y"));
        assertEquals("X<=Y", ensureNoAssignmentsUsed("X <= Y"));
        assertEquals("X>=Y", ensureNoAssignmentsUsed("X => Y"));
        assertEquals("X<=Y", ensureNoAssignmentsUsed("X =< Y"));
        assertEquals("X==Y", ensureNoAssignmentsUsed("X == Y"));

        // Complex
        assertEquals("X==Y+Z && Y>=Z", ensureNoAssignmentsUsed("X=Y+Z && Y=>Z"));

    }
}

 

I hope this mini tutorial gave a bit more insight into what lookaround  functions of java RegEx do. By no means above code is complete - there are few more improvements that can be done. The intention was to give you some insight into how to solve these kind of problems in java.

Let me know if you have some suggestions.

 

P.S. There is an awesome online RegEx tester

Latest sources are on github

This page was last updated on: 27/03/2014 14:02