Web Crawling & Java Regular Expressions
Hi. =:)
I am currently using a search solution called LucidWorks Enterprise which has an integrated web crawler and uses Java Regular Expressions.
Description
Quote:
To limit crawls to the URL specified in the data source definition, you could enter the following in the Included URLs field:
Code :
http://www\.lucidimagination\.com/.*
To limit crawls to a path that is relative to the URL specified, you could enter the following in the Included URLs field:
To limit a crawl of Wikipedia to topics only (not other pages such as history or info), you could enter the following in the Included URLs field
Code :
http://en\.wikipedia\.org/wiki/[^/?]+
With that in mind, I want to crawl a large web site but only index certain url's that adhere to a certain pattern.
URL Examples
Code :
http://www.example.com/insanely-twisted-shadow-planet/61-27375/reviews/
http://www.example.com/catherine/61-32367/reviews/
Also, these url's are in the category:
Code :
http://www.example.com/reviews/
From this category reviews are further broken down into additional pages:
Code :
http://www.example.com/reviews/?page=2
http://www.example.com/reviews/?page=3
So I set the crawler to index:
With the included path:
I get about 50 reviews indexed so it appears the crawler is only indexing pages found on...
Code :
http://www.example.com/reviews/
...with /reviews/ in the url. If go ahead and add the additional path:
...I get all reviews indexed. So far so good. The problem is I also get pages I do not want to be indexed, indexed...
Code :
http://www.example.com/reviews/?page=2
http://www.example.com/reviews/?page=3
http://www.example.com/reviews/?page=4
I do want the crawler to visit:
Code :
http://www.example.com/reviews/?page=2
http://www.example.com/reviews/?page=3
http://www.example.com/reviews/?page=4
...and just index reviews found on each of those pages, but not index the actual index pages leading to the reviews:
Code :
http://www.example.com/reviews/?page=2]
...because it makes search results messy and inflates database size.
I realise this isn't a Lucidworks web site, my question is related to Java Regular Expressions; is there an expression that I can use that will tell the crawler to visit all pages in:
...but only index the url if it has /review/ at the end of the url, while not indexing any page that also has ?page=x in the url?
If anyone has a better idea how I can achieve what I want to achieve, please let me know I would really appreciate it.
Thanks for reading this long rambling post. ;))
Sorry of this is the wrong forum for this type of question.
Re: Web Crawling & Java Regular Expressions
Nvm read the question wrong, will edit this again in a moment
Sigh I can't get my expression to work, I am sad now :mad: