|
||
|
||||
|
Text Processing with Regular Expressions explained
A string can be formatted or parsed based on a specified pattern that will be searched in the string. In order to format or parse data (text or data types), you want to be able to tell your code: pick up that data item from there, and do this to it. How can we do this? You do it with a search based on pattern matching. The search pattern is described by what is called a Regular Expression (regex for short). In other words, Regular Expressions can be used to process text. A piece of text (sequence of characters) that is found to correspond to the search pattern is called a match. For example, you may want to say: Find a dot (.) in a string and split the string around each dot you find. That means if there are two dots in a string, the string will be split into three pieces. Or you may want to validate user input such as an email address. Java provides support for Regular Expressions (to define search patterns) and for matching the patters (to the text in the string) by providing the following elements: The java.util.regex.Pattern class The regular expression constructs The java.util.regex.Matcher class The simplest form of a regular expression is searching for string literal such as "Hello". You may want to look for certain words in a users input. However, you can also build very sophisticated expressions using what are called Regular Expression Constructs. Consider the following expression: [A-Za-z0-9] Any character in the range of A through Z, a through z, or 0 through 9 (any letter or digit) will match this pattern. Any other character will match the pattern described by the following expression: [^A-Za-z0-9] Notice the ^ character. This ^ character negates the expression. Here are some important points about constructs: Use backslash (\) as an escape character. For example, \. matches a period, whereas \\ matches a backslash. Use | or logical OR, ^ to match the beginning of a line, and $ to match the end of a line. Remember that ^ inside [ ] means negation. A Character class is a set of character alternatives enclosed in brackets; for example, [abc] means a, b or c. The character - denotes a range, and the character ^ inside [ ] denotes a negation (that is all the characters except those specified here). For example, the character class [^a-zA-Z] means all the characters that are not included in the range a through z and A through Z. There are several character classes that are already defined for you, such as \d means all digits. Character Classes (Brackets used as a grouping mechanism) [ABC] Any of the characters represented by A, B, C etc. [^ABC] Any character except A, B, C (negation) [a-zA-Z] a through z or A through Z (range) [...&&...] Intersection of two sets (AND) Predefined Chatacter Classes . (dot) Any character if the DOTALL flag is set, else any character except the line terminators. \d A digit [0-9] \D A non-digit: [^0-9] \s A whitespace character \S A non whitespace character \w A word character: [a-zA-Z0-9] \W A non-word character: [^\w] If you are looking for a regular expression, say X, to repeat itself a number of times, you can say it in the pattern by using a quantifier immediately following X. For example, X+ means one or more X. Greedy Quantifiers (X Represents Regular expression) X? X, zero or one time X* X, zero or more times X+ X, one or more times X{n} X, exactly n times Some other Constructs ^ The beginning of a line XY Y following X X|Y Either X or Y (?:X) X, as a noncapturing group The following is a typical process for pattern matching:
Java Code
Pattern p = Pattern.compile("[^a-zA-Z0-9]");
Java Code
Matcher m = p.matches("myemail@emailaddress.com")
Java Code
boolean b = m.find(); The following is a code example to validate email addresses: Java Code
import java.util.regex.*;
public class EmailValidator
{
public static void main(String[] args)
{
String email="";
if(args.length < 1)
{
System.out.println("Command syntax: java EmailValidator <emailAddress>");
System.exit(0);
}
else
{
email = args[0];
}
//Look for for email addresses starting with
//invalid symbols: dots or @ signs.
Pattern p = Pattern.compile("^\\.+|^\\@+");
Matcher m = p.matcher(email);
if (m.find())
{
System.err.println("Invalid email address: starts with a dot or an @ sign.");
System.exit(0);
}
//Look for email addresses that start with www.
p = Pattern.compile("^www\\.");
m = p.matcher(email);
if (m.find())
{
System.out.println("Invalid email address: starts with www.");
System.exit(0);
}
p = Pattern.compile("[^A-Za-z0-9\\@\\.\\_]");
m = p.matcher(email);
if(m.find())
{
System.out.println("Invalid email address: contains invalid characters");
}
else
{
System.out.println(args[0] + " is a valid email address.");
}
}
}
This code will be taking the user input as a parameter. For this to work properly you will need to give a string as a command-line argument. Try a few combinations of email addresses and see what you get. As you will see, this code will only return a correctly formatted email address as correct.
__________________
Don't forget to add syntax highlighted code tags around your code: [highlight=Java] code here [/highlight] Forum Tip: Add to peoples reputation ( ) by clicking the button on their useful posts.
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| [SOLVED] Java Regular Expressions (regex) Greif | username9000 | Java SE APIs | 4 | 11-06-2009 10:53 PM |
| Formating text | fourseven | Java Theory & Questions | 3 | 17-05-2009 02:42 AM |
| text file | tyolu | File I/O & Other I/O Streams | 2 | 13-05-2009 12:17 PM |
| Printing to a text area | American Raptor | AWT / Java Swing | 1 | 01-04-2009 07:09 PM |
| How to Validate an email address using Regular Expressions | JavaPF | Java Code Snippets and Tutorials | 0 | 19-05-2008 12:26 PM |