Java Regular expression

Regular expressions define the pattern of strings.

Regular expressions can be used to search, edit or process text.

Regular expressions are not limited to a single language, but there are subtle differences in each language.

Regular expression example

A string is actually a simple regular expression. For example, the Hello World regular expression matches the "Hello World" string.

. (dot) is also a regular expression that matches any one of the characters like "a" or "1".

The following table lists some examples and descriptions of regular expressions:

Regular expression	description
This is text	Match the string "this is text"
This\s+is\s+text	Notice the \s+ in the string . Match the \s+ after the word "this" to match multiple spaces, then match the is string, then \s+ matches multiple spaces and then follows the text string. Can match this instance: this is text
^\d+(\.\d+)?	^ Defined starting with what \d+ matches one or more digits Setting options in parentheses is optional \. Match "." Examples that can be matched: "5", "1.5" and "2.21".

Java regular expressions are the most similar to Perl's.

The java.util.regex package mainly includes the following three classes:

Pattern class:
The pattern object is a compiled representation of a regular expression. The Pattern class has no public constructor. To create a Pattern object, you must first call its public static compilation method, which returns a Pattern object. This method takes a regular expression as its first argument.
Matcher class:
Matcher objects are engines that interpret and match input strings. Like the Pattern class, Matcher does not have a public constructor. You need to call the Matcher method of the Pattern object to get a Matcher object.
PatternSyntaxException:
PatternSyntaxException is a non-forced exception class that represents a syntax error in a regular expression pattern.

The regular expression is used in the following example . *runoob.* is used to find out if the string contains runoob substrings:

Examples

import java.util.regex.*;

 
class RegexExample1{
   public static void main(String args[]){
      String content = "I am bubble " +
        "from interviewbubble.com.";
 
      String pattern = ".*bubble.*";
 
      boolean isMatch = Pattern.matches(pattern, content);
      System.out.println("Does the string contain the 'bubble' substring?" + isMatch);
   }
}

The output of the instance is:

Does the string contain a 'runoob' substring? true

Capture group

A capture group is a method of treating multiple characters as a single unit. It is created by grouping the characters in parentheses.

For example, a regular expression (dog) creates a single grouping containing "d", "o", and "g".

The capture group is numbered by counting its open brackets from left to right. For example, in the expression ((A)(B(C))), there are four such groups:

((A)(B(C)))
(A)
(B(C))
(C)

You can see how many groups an expression has by calling the matchCount method of the matcher object. The groupCount method returns an int value indicating that the matcher object currently has more than one capture group.

There is also a special group (group(0)), which always represents the entire expression. This group is not included in the return value of groupCount.

Examples

The following example shows how to find a numeric string from a given string:

RegexMatches.java file code:

import java.util.regex.Matcher;

import java.util.regex.Pattern;
 
public class RegexMatches
{
    public static void main( String args[] ){
 
      // Find in a specified pattern in a string    

      String line = "This order was placed for QT3000! OK?";
      String pattern = "(\\D*)(\\d+)(.*)";
 
       // Create a Pattern object  
      Pattern r = Pattern.compile(pattern);
 
       // Create matcher object now 
      Matcher m = r.matcher(line);
      if (m.find( )) {
         System.out.println("Found value: " + m.group(0) );
         System.out.println("Found value: " + m.group(1) );
         System.out.println("Found value: " + m.group(2) );
         System.out.println("Found value: " + m.group(3) ); 
      } else {
         System.out.println("NO MATCH");
      }
   }
}

The above example compiled and run results are as follows:

Found value: This order was placed for QT3000! OK?
Found value: This order was placed for QT
Found value: 3000
Found value: ! OK?

Regular expression syntax

In other languages, \\ means: I want to insert a regular (literally) backslash in the regular expression, please do not give it any special meaning.

In Java, \\ says: I want to insert a regular expression backslash, so the following characters have special meaning.

So, in other languages (such as Perl), a backslash \ is enough to have an escape function, and in Java regular expressions need to have two backslashes in order to be resolved to turn in other languages Meaning role. It is also easy to understand that in Java's regular expressions, two \\ represent one \ in other languages , which is why a regular expression that represents a digit is \\d and represents an ordinary backslash Is \\\\ .

character	Instructions
\	Marks the next character as a special character, text, backreference, or octal escape. For example, "n" matches the character "n". "\n" matches a newline character. The sequence "\\" matches "\\", "\\(" matches "(".
^	Matches the beginning of the input string. If you set the Multiline property of the RegExp object , ^ will also match the position after "\n" or "\r".
$	Matches the end of the input string. If you set the Multiline property of the RegExp object , $ will also match the position before "\n" or "\r".
*	Match zero or more occurrences of the preceding character or subexpression. For example, zo* matches "z" and "zoo". * Equivalent to {0,}.
+	Matches the preceding character or subexpression one or more times. For example, "zo+" matches "zo" and "zoo" but does not match "z". + is equivalent to {1,}.
?	Matches the preceding character or subexpression zero or once. For example, "do(es)?" matches "do" in "do" or "does". • Equivalent to {0,1}.
{ n }	n is a non-negative integer. Just match n times. For example, "o{2}" does not match "o" in "Bob" but matches two "o" in "food".
{ n,}	n is a non-negative integer. Match at least n times. For example, "o{2,}" does not match "o" in "Bob" but all o in "foooood". "o{1,}" is equivalent to "o+". "o{0,}" is equivalent to "o*".
{ n , m }	M and n are non-negative integers, where n <= m . Match at least n times, at most m times. For example, "o{1,3}" matches the first three o in "fooooood". 'o{0,1}' is equivalent to 'o?'. Note: You cannot insert spaces between commas and numbers.
?	When this character immediately follows any other qualifiers (*, +, ?, { n }, { n ,}, { n , m }), the matching pattern is "non-greedy". The "non-greedy" pattern matches the shortest possible search string, while the default "greedy" pattern matches the searched, longest possible string. For example, in the string "oooo", "o+?" only matches a single "o", and "o+" matches all "o."
.	Matches any single character except "\r\n". To match any character including "\r\n", use a pattern such as "[\s\S]".
( pattern )	Match pattern and capture the matching sub-expression. You can use the $0...$9 property to retrieve the captured match from the result "matches" collection. To match the bracket character (), use "\(" or "\")".
(?: pattern)	Match the pattern without capturing the matching sub-expression, ie it is a non-capture match and does not store the match for later use. This is useful for combining pattern parts with the "or" character (\|). For example, 'industr(?:y\|ies) is a more economical expression than 'industry\|industries'.
(?= pattern)	A sub-expression that performs a forward prediction pre-search, which matches a string that is at the start of the string that matches pattern . It is a non-capture match, ie it cannot catch matches for later use. For example, 'Windows (?=95\|98\|NT\|2000)' matches "Windows" in "Windows 2000" but does not match "Windows" in "Windows 3.1." Predictive preemption does not take up characters, that is, after a match occurs, the next matching search follows the previous match, rather than the character that composes the prediction predecessor.
(?! pattern)	A subexpression that performs a backward prediction lookup that matches a search string that is not at the beginning of the string that matches pattern . It is a non-capture match, ie it cannot catch matches for later use. For example, 'Windows (?!95\|98\|NT\|2000)' matches "Windows" in "Windows 3.1" but does not match "Windows" in "Windows 2000." Predictive preemption does not take up characters, that is, after a match occurs, the next matching search follows the previous match, rather than the character that composes the prediction predecessor.
x \| y	Match x or y . For example, 'z\|food' matches "z" or "food". '(z\|f)ood' matches either "zood" or "food".
[ xyz ]	character set. Match any contained character. For example, "[abc]" matches "a" in "plain".
[^ xyz ]	Reverse character set. Match any character that is not included. For example, "[^abc]" matches "p", "l", "i", "n" in "plain".
[ az]	Character range. Match any character within the specified range. For example, "[az]" matches any lowercase letter in the range from "a" to "z".
[^ az ]	Reverse range character. Match any character that is not in the specified range. For example, "[^az]" matches any character that is not in the range from "a" to "z".
\b	Matches a word boundary, which is the position between the word and the space. For example, "er\b" matches "er" in "never" but does not match "er" in "verb".
\B	Non-word boundary matching. "er\B" matches "er" in "verb" but does not match "er" in "never".
\c x	Match the control character indicated by x . For example, \cM matches Control-M or a carriage return. The value of xmust be between AZ or az. If this is not the case, it is assumed that c is the "c" character itself.
\d	Numeric characters match. Equivalent to [0-9].
\D	Non-numeric characters match. Equivalent to [^0-9].
\f	Page breaks match. Equivalent to \x0c and \cL.
\n	Line breaks match. Equivalent to \x0a and \cJ.
\r	Match a carriage return. Equivalent to \x0d and \cM.
\s	Matches any white space character, including spaces, tabs, page breaks, and so on. Is equivalent to [\f\n\r\t\v].
\S	Match any non-whitespace character. Is equivalent to [^\f\n\r\t\v].
\t	Tab matches. Equivalent to \x09 and \cI.
\v	Vertical tab matches. Equivalent to \x0b and \cK.
\w	Match any character class, including underscores. Is equivalent to "[A-Za-z0-9_]".
\W	Match any non-word character. Is equivalent to "[^A-Za-z0-9_]".
\x n	Match n , where n is a hexadecimal escape code. Hexadecimal escape codes must be exactly two digits long. For example, "\x41" matches "A". "\x041" is equivalent to "\x04"&"1". Allows ASCII code in regular expressions.
\ num	Match num , where num is a positive integer. Capture the matching back reference. For example, "(.)\1" matches two consecutive identical characters.
\ n	Identifies an octal escape code or backreference. If there are at least n captured subexpressions before \ n , then n is a backreference. Otherwise, if n is an octal number (0-7), then n is an octal escape code.
\ nm	Identifies an octal escape code or backreference. If there are at least nm capture sub-expressions before \ nm , then nm is a back-reference. If there are at least n captures before \ nm , then n is a backreference, followed by the character m . If neither of the two preceding conditions exist, \ nm matches the octal value nm , where n and m are octal numbers (0-7).
\ nml	When n is an octal number (0-3) and m and l are octal numbers (0-7), the octal escape code nml is matched .
\u n	Matches n , where n is a Unicode character expressed as a four-digit hexadecimal number. For example, \u00A9 matches the copyright symbol (©).

According to the requirements of the Java Language Specification, backslashes in Java source code strings are interpreted as Unicode escapes or other character escapes. Therefore, you must use two backslashes in the string literal to indicate that the regular expression is protected and not interpreted by the Java bytecode compiler. For example, when interpreted as a regular expression, the string literal "\b" matches a single backspace character, and "\\b" matches the word boundary. The string literal "$hello$" is illegal and will result in a compile-time error; to match the string (hello) you must use the string literal "\$hello\$".

Matcher method

Index method

The indexing method provides useful index values that show exactly where in the input string a match can be found:

No.	Method and description
1	Public int start() Returns the previous matching initial index.
2	Public int start(int group) Returns the initial index of the subsequence captured by the given group during the previous matching operation
3	Public int end() Returns the offset after the last matched character.
4	Public int end(int group) Returns the offset after the last character of the subsequence captured by the given group during the previous matching operation.

Research methods

The study method is used to check the input string and return a boolean indicating whether the pattern was found:

No.	Method and description
1	Public boolean lookAt() attempts to match the pattern with the input sequence starting at the beginning of the area.
2	Public boolean find() Attempts to find the next subsequence of the input sequence that matches this pattern.
3	Public boolean find(int start ) resets this matcher and then tries to find the next subsequence of the input sequence that matches the pattern, starting at the specified index.
4	Public boolean matches() attempts to match the entire region with the pattern.

Replacement method

The replacement method is to replace the text in the input string:

No.	Method and description
1	Public Matcher appendReplacement(StringBuffer sb, String replacement) Implement non-terminal add and replace steps.
2	Public StringBuffer appendTail(StringBuffer sb) Implements terminal add and replace steps.
3	Public String replaceAll(String replacement) Replaces each subsequence of the input sequence that matches the given replacement string.
4	Public String replaceFirst(String replacement) Replaces the first subsequence of the input sequence that matches the given replacement string.
5	Public static String quoteReplacement(String s) Returns a literal replacement string for the specified string. This method returns a string that works just like a literal string passed to the appendReplacement method of the Matcher class.

Start and end methods

Here is an example of counting the number of occurrences of the word "cat" appearing in the input string:

RegexMatches.java file code:

import java.util.regex.Matcher;
import java.util.regex.Pattern;
 
public class RegexMatches
{
    private static final String REGEX = "\\bcat\\b";
    private static final String INPUT =
                                    "cat cat cat cattie cat";
 
    public static void main( String args[] ){
       Pattern p = Pattern.compile(REGEX);
       Matcher m = p.matcher(INPUT); // Get matcher object
       int count = 0;
 
       while(m.find()) {
         count++;
         System.out.println("Match number "+count);
         System.out.println("start(): "+m.start());
         System.out.println("end(): "+m.end());
      }
   }
}

The above example compiled and run results are as follows:

Match number 1
start(): 0
end(): 3
Match number 2
start(): 4
end(): 7
Match number 3
start(): 8
end(): 11
Match number 4
start(): 19
end(): 22

You can see that this example uses word boundaries to ensure that the letter "c" "a" "t" is not just a substring of a longer word. It also provides some useful information about where the match occurred in the input string.

The Start method returns the initial index of the subsequence captured by the given group during the previous matching operation, and the index of the last matched character of the end method is incremented by one.

Matches and lookingAt methods

Both the matches and lookingAt methods are used to try to match an input sequence pattern. The difference is that the matches require the entire sequence to match, and lookingAt is not required.

The lookAt method does not need to match the entire sentence, but it needs to match from the first character.

These two methods are often used at the beginning of the input string.

We explain this function by the following example:

RegexMatches.java file code:

import java.util.regex.Matcher;
import java.util.regex.Pattern;
 
public class RegexMatches
{
    private static final String REGEX = "foo";
    private static final String INPUT = "fooooooooooooooooo";
    private static final String INPUT2 = "ooooofoooooooooooo";
    private static Pattern pattern;
    private static Matcher matcher;
    private static Matcher matcher2;
 
    public static void main( String args[] ){
       pattern = Pattern.compile(REGEX);
       matcher = pattern.matcher(INPUT);
       matcher2 = pattern.matcher(INPUT2);
 
       System.out.println("Current REGEX is: "+REGEX);
       System.out.println("Current INPUT is: "+INPUT);
       System.out.println("Current INPUT2 is: "+INPUT2);
 
 
       System.out.println("lookingAt(): "+matcher.lookingAt());
       System.out.println("matches(): "+matcher.matches());
       System.out.println("lookingAt(): "+matcher2.lookingAt());
   }
}

The above example compiled and run results are as follows:

Current REGEX is : foo
Current INPUT is : fooooooooooooooooo
Current INPUT2 is : ooooofoooooooooooo
lookingAt (): true 
matches (): false 
lookingAt (): false

replaceFirst and replaceAll methods

The replaceFirst and replaceAll methods are used to replace the text that matches the regular expression. The difference is that replaceFirst replaces the first match and replaceAll replaces all matches.

The following example explains this function:

RegexMatches.java file code:

import java.util.regex.Matcher;
import java.util.regex.Pattern;
 
public class RegexMatches
{
    private static String REGEX = "dog";
    private static String INPUT = "The dog says meow. " +
                                    "All dogs say meow.";
    private static String REPLACE = "cat";
 
    public static void main(String[] args) {
       Pattern p = Pattern.compile(REGEX);
       // get a matcher object
       Matcher m = p.matcher(INPUT); 
       INPUT = m.replaceAll(REPLACE);
       System.out.println(INPUT);
   }
}

The above example compiled and run results are as follows:

The cat says meow . All cats say meow .

appendReplacement and appendTail methods

The Matcher class also provides appendReplacement and appendTail methods for text replacement:

Take a look at the following example to explain this feature:

RegexMatches.java file code:

import java.util.regex.Matcher;
import java.util.regex.Pattern;
 
public class RegexMatches
{
   private static String REGEX = "a*b";
   private static String INPUT = "aabfooaabfooabfoobkkk";
   private static String REPLACE = "-";
   public static void main(String[] args) {
      Pattern p = Pattern.compile(REGEX);
       // Get the matcher object     
      Matcher m = p.matcher(INPUT);
      StringBuffer sb = new StringBuffer();
      while(m.find()){
         m.appendReplacement(sb,REPLACE);
      }
      m.appendTail(sb);
      System.out.println(sb.toString());
   }
}

The above example compiled and run results are as follows:

- foo - foo - foo - kkk

PatternSyntaxException method

PatternSyntaxException is a non-forced exception class that indicates a syntax error in a regular expression pattern.

The PatternSyntaxException class provides the following methods to help us see what went wrong.

No.	Method and description
1	Public String getDescription() Get the description of the error.
2	Public int getIndex() Gets the wrong index.
3	Public String getPattern() Gets the wrong regular expression pattern.
4	Public String getMessage() Returns a multi-line string containing a description of the syntax error and its index, an incorrect regular expression pattern, and a visual indication of the error index in the pattern.

Java Tutorial and Concepts

Search This Blog