Welcome to the Java Programming Forums


The professional, friendly Java community. 21,500 members and growing!


The Java Programming Forums are a community of Java programmers from all around the World. Our members have a wide range of skills and they all have one thing in common: A passion to learn and code Java. We invite beginner Java programmers right through to Java professionals to post here and share your knowledge. Become a part of the community, help others, expand your knowledge of Java and enjoy talking with like minded people. Registration is quick and best of all free. We look forward to meeting you.


>> REGISTER NOW TO START POSTING


Members have full access to the forums. Advertisements are removed for registered users.

Results 1 to 2 of 2

Thread: Web Crawling & Java Regular Expressions

  1. #1
    Junior Member
    Join Date
    Nov 2011
    Posts
    1
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Question Web Crawling & Java Regular Expressions

    Hi.

    I am currently using a search solution called LucidWorks Enterprise which has an integrated web crawler and uses Java Regular Expressions.

    Description

    To limit crawls to the URL specified in the data source definition, you could enter the following in the Included URLs field:

    http://www\.lucidimagination\.com/.*

    To limit crawls to a path that is relative to the URL specified, you could enter the following in the Included URLs field:

    .*/Relative-path/

    To limit a crawl of Wikipedia to topics only (not other pages such as history or info), you could enter the following in the Included URLs field

    http://en\.wikipedia\.org/wiki/[^/?]+
    With that in mind, I want to crawl a large web site but only index certain url's that adhere to a certain pattern.

    URL Examples

    http://www.example.com/insanely-twisted-shadow-planet/61-27375/reviews/
    http://www.example.com/catherine/61-32367/reviews/

    Also, these url's are in the category:

    http://www.example.com/reviews/

    From this category reviews are further broken down into additional pages:

    http://www.example.com/reviews/?page=2
    http://www.example.com/reviews/?page=3

    So I set the crawler to index:

    http://www.example.com/

    With the included path:

    .*/reviews/

    I get about 50 reviews indexed so it appears the crawler is only indexing pages found on...

    http://www.example.com/reviews/

    ...with /reviews/ in the url. If go ahead and add the additional path:

    .*/reviews/*

    ...I get all reviews indexed. So far so good. The problem is I also get pages I do not want to be indexed, indexed...

    http://www.example.com/reviews/?page=2
    http://www.example.com/reviews/?page=3
    http://www.example.com/reviews/?page=4

    I do want the crawler to visit:

    http://www.example.com/reviews/?page=2
    http://www.example.com/reviews/?page=3
    http://www.example.com/reviews/?page=4

    ...and just index reviews found on each of those pages, but not index the actual index pages leading to the reviews:

    http://www.example.com/reviews/?page=2]

    ...because it makes search results messy and inflates database size.

    I realise this isn't a Lucidworks web site, my question is related to Java Regular Expressions; is there an expression that I can use that will tell the crawler to visit all pages in:

    /reviews/

    ...but only index the url if it has /review/ at the end of the url, while not indexing any page that also has ?page=x in the url?

    If anyone has a better idea how I can achieve what I want to achieve, please let me know I would really appreciate it.

    Thanks for reading this long rambling post.

    Sorry of this is the wrong forum for this type of question.


  2. #2
    Forum VIP
    Join Date
    Oct 2010
    Posts
    275
    My Mood
    Cool
    Thanks
    32
    Thanked 54 Times in 47 Posts
    Blog Entries
    2

    Default Re: Web Crawling & Java Regular Expressions

    Nvm read the question wrong, will edit this again in a moment

    Sigh I can't get my expression to work, I am sad now
    Last edited by Tjstretch; November 18th, 2011 at 03:08 PM.

Similar Threads

  1. Replies: 3
    Last Post: December 22nd, 2011, 09:46 AM
  2. regular expressions
    By brad35309 in forum What's Wrong With My Code?
    Replies: 2
    Last Post: April 5th, 2010, 08:30 PM
  3. [SOLVED] Java Regular Expressions (regex) Greif
    By username9000 in forum Java SE APIs
    Replies: 4
    Last Post: June 11th, 2009, 05:53 PM
  4. Text Processing with Regular Expressions explained in Java
    By JavaPF in forum Java Code Snippets and Tutorials
    Replies: 1
    Last Post: August 6th, 2008, 02:03 AM
  5. Java program to validate an email address using Regular Expressions
    By JavaPF in forum Java Code Snippets and Tutorials
    Replies: 0
    Last Post: May 19th, 2008, 07:26 AM

Tags for this Thread