finding unescaped XML characters
hi,
I want to find the unscaped characters in some XML input in order to replace them with their escape sequences through my code.
For instance, my XML input would consist unescaped and escaped & characters in it. What i want to achieve is that find only all the unescaped & characters and replace them with '&' i.e. their escape character.
I have thought of using pattern matching in Java to solve this problem and have ofund out the pattern &[a-z]+; would help find all the escape sequences in the XML input.
however, i am not able to create a pattern for finding the unescaped & characters in the XML input.
please advise.
regads,
Diptee
Re: finding unescaped XML characters
sorry the pattern i used for matching the escape sequences is &[\\w]+;
Re: finding unescaped XML characters
Quote:
Originally Posted by
diptee
sorry the pattern i used for matching the escape sequences is &[\\w]+;
http://www.javaprogrammingforums.com...explained.html
Please show me the code you have and an example xml input :)
Re: finding unescaped XML characters
The code i have used is as below
Code java:
/*
* Created on 10/11/2008
*
*/
import java.io.*;
import java.util.Map;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
import java.util.regex.PatternSyntaxException;
public class SearchAndReplaceUnescapedAmpersand
{
public static void main(String[] args) throws PatternSyntaxException
{
Console console = System.console();
String str = new String("&li; &");
try
{
Pattern notEscSeq = Pattern.compile("&[\\w]+;",Pattern.CASE_INSENSITIVE); //this pattern finds all the escape sequences I need to get the pattern right so that it finds the unescaped & chars
Matcher matcher=notEscSeq.matcher(str);
while (matcher.find()) {
console.format("I found the text \"%s\" starting at " + "index %d and ending at index %d.%n", matcher.group(), matcher.start(), matcher.end());
}
System.out.println("Before: "+str);
//str = matcher.replaceAll("& ");
}
catch(PatternSyntaxException pse)
{
throw pse;
}
System.out.println("After: "+str);
}
}
and the XML input containing the unescaped & char would be like
Code :
<Q-ENV:Attachment>
<Q-ENV:Content-Type>application/zip</Q-ENV:Content-Type>
<Q-ENV:Message-ID>urn:x-commerceone:package:com:commerceone:INV-1020464.01 & .02-attachment.zip</Q-ENV:Message-ID>
<Q-ENV:Encoding>base64</Q-ENV:Encoding>
The XML input might include escaped & chars also but not in the above location.
regards,
Diptee
Re: finding unescaped XML characters
I've edited your post to place the code in formatting tags. In the future, it helps to flank code with [highlight=java]Code goes hear [/highlight] or [code]Code goes hear [/code] so those smilies are parsed into emoticons. You might consider using negations to exclude already escaped values. This won't work, but may get you started: "&[^;]{2,6}[^;]"
Re: finding unescaped XML characters
hi i am slightly new to java patterns so can you please explain what would be achieved with the pattern you sugested &[^;]{2,6}[^;]
All i undertsnd is that it will find the & characters which are not follwed by a ;
please correct me if i am wrong here.
and i tried to use the pattern you provided but it didnt help :(.
Thanks for letting me know the correct way to post code :)
Re: finding unescaped XML characters
Quote:
Originally Posted by
diptee
hi i am slightly new to java patterns so can you please explain what would be achieved with the pattern you sugested &[^;]{2,6}[^;]
All i undertsnd is that it will find the & characters which are not follwed by a ;
please correct me if i am wrong here.
and i tried to use the pattern you provided but it didnt help :(.
Told you it wouldn't work :D, but thought it might lead you in the right direction. Basically, I thought your question was that you wanted to replace all '&' with '&', but leave the other '&espacetext;' type of escaped values alone. Just doing a replace on the & obviously won't work. The regular expression I posted was along the lines of trying to look for amps that are not followed by the typical & escaped pattern. If it doesn't make much sense, I suggest looking through Regular Expressions for great info on regular expressions.
Re: finding unescaped XML characters
yeah i started with the use of negations in my reg ex but coud nt really achieve wat i needed, will go thru the link u provided and get back if reqd.
thanks