Simple Hashsets and regular expression problem !
Hey people,
I am trying to make a program on which is able to gather some url from a web page through the method html parser and then filter them.
The section where the url is gathered is done - its is placed under hashsets.
But i find it hard to take the links thats is in the hashtable and filter them with regular expression ! :(
I just want to take the hyperlinks and filter !
I am sure this is a simple problem - but java is not my speciality so i am slow when it comes to this.
here is the program so far !
Code :
import org.htmlparser.util.*;
import java.util.Iterator;
import org.htmlparser.*;
import org.htmlparser.tags.*;
import org.htmlparser.filters.*;
import java.util.HashSet;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Hash
{
static Matcher m;
static String st;
public static HashSet<String> visit (String s)
{
HashSet <String> s1 = new HashSet();
try{
Parser parser1 = new Parser (s);
NodeList list1 = parser1.parse (new LinkStringFilter("http:"));
for (int i=0;i<list1.size();i++)
{
String st = ((LinkTag)(list1.elementAt(i))).extractLink();
s1.add(st);
Iterator iter = s1.iterator();
while (iter.hasNext()){
String str = (String)iter.next() ;
Pattern pattern = Pattern.compile("html");
m = pattern.matcher(str);
}
if (m.find()){
System.out.println(m);
}
}
return s1;
}
catch (Exception e)
{
return new HashSet();
}
}
I try to put an regular expression method but its prints out all of the links instead of the links with html on it.
I hope someone could help me !:(
Re: Simple Hashsets and regular expression problem !
I'm not sure I understanding what you are asking. Are you just trying to just get the links that end in ".html"? Does your code print out every link? Your code is quite difficult to read with the non-preserved tabbing and excess spaces - and I would recommend posting an SSCCE with a hardcoded example that demonstrates the issue you are having.
Re: Simple Hashsets and regular expression problem !
Yes , i want to filter the links that contain "html" .
Yes, its just prints out every link !
Okay, I don't want to confuse you so i will show the program without having regular expression and an example of how it works.
Quote:
import org.htmlparser.util.*;
import java.util.Iterator;
import org.htmlparser.*;
import org.htmlparser.tags.*;
import org.htmlparser.filters.*;
import java.util.HashSet;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Hash
{
static Matcher m;
static String st;
public static HashSet<String> visit (String s)
{
HashSet <String> s1 = new HashSet();
try{
Parser parser1 = new Parser (s);
NodeList list1 = parser1.parse (new LinkStringFilter("http:"));
for (int i=0;i<list1.size();i++)
{
String st = ((LinkTag)(list1.elementAt(i))).extractLink();
s1.add(st);
}
return s1;
}
catch (Exception e)
{
return new HashSet();
}
}
}
This the output of the program :
So all i want to do is implement a regular expression method onto the code. So it can filter all the links that the program has collected.
So that where u saw my fail attempt above - where i tried to do regex but all it did was print all the links instead filtering,
I hope i made it more sense to u now :) !