Web Crawling & Java Regular Expressions
I am currently using a search solution called LucidWorks Enterprise which has an integrated web crawler and uses Java Regular Expressions.
With that in mind, I want to crawl a large web site but only index certain url's that adhere to a certain pattern.
To limit crawls to the URL specified in the data source definition, you could enter the following in the Included URLs field:
To limit crawls to a path that is relative to the URL specified, you could enter the following in the Included URLs field:
To limit a crawl of Wikipedia to topics only (not other pages such as history or info), you could enter the following in the Included URLs field
Also, these url's are in the category:
From this category reviews are further broken down into additional pages:
So I set the crawler to index:
With the included path:
I get about 50 reviews indexed so it appears the crawler is only indexing pages found on...
...with /reviews/ in the url. If go ahead and add the additional path:
...I get all reviews indexed. So far so good. The problem is I also get pages I do not want to be indexed, indexed...
I do want the crawler to visit:
...and just index reviews found on each of those pages, but not index the actual index pages leading to the reviews:
...because it makes search results messy and inflates database size.
I realise this isn't a Lucidworks web site, my question is related to Java Regular Expressions; is there an expression that I can use that will tell the crawler to visit all pages in:
...but only index the url if it has /review/ at the end of the url, while not indexing any page that also has ?page=x in the url?
If anyone has a better idea how I can achieve what I want to achieve, please let me know I would really appreciate it.
Thanks for reading this long rambling post. ;))
Sorry of this is the wrong forum for this type of question.
Re: Web Crawling & Java Regular Expressions
Nvm read the question wrong, will edit this again in a moment
Sigh I can't get my expression to work, I am sad now :mad: