URLConnection inconsistency + SSCCE
So I decided to try out URL connections and requests and whatnot, and I didn't want to use any third-party libraries for my first time, and I am quite confused why it is being very, very inconsistent. The idea was to use wiktionary's api for a dictionary for a small application. However, the URLConnection will seemingly randomly return 0 after the first few calls, and then sometimes get two more useful connections and then back to 0. This might be more clear with a SSCCE to demonstrate the problem.
Code Java:
import java.net.*;
import java.io.*;
/**
* This class is an SSCEE to demonstrate the inconsistency
* in the URLConnection class
*/
public class URLConnectionTest
{
public static void main(String[] args) throws Exception //No error handling to keep it short and sweet
{
for(int i = 0; i < 15; i++)
{
HttpURLConnection connection = (HttpURLConnection)new URL("http://en.wiktionary.org/wiki/amo?action=raw").openConnection();
InputStream in = connection.getInputStream();
System.out.println((i+1)+". Expected bytes = "+in.available());
in.close();
connection.disconnect();
}
}
}
Average case:
Code :
1. Expected bytes = 2896
2. Expected bytes = 0
3. Expected bytes = 0
4. Expected bytes = 0
5. Expected bytes = 0
6. Expected bytes = 0
7. Expected bytes = 0
8. Expected bytes = 0
9. Expected bytes = 0
10. Expected bytes = 0
11. Expected bytes = 0
12. Expected bytes = 0
13. Expected bytes = 0
14. Expected bytes = 0
15. Expected bytes = 0
Edit:
The response code is always 200. [HTTP_OK]
Edit 2:
The first time it takes around 330 ms, the rest take around 220 ms most of the time, but every once in a while the first time is faster. Also, no amount of delay/forcing the gc makes any difference.
Re: URLConnection inconsistency + SSCCE
You have to remember that the program that's sending the data at the other end of that stream is running asynchronously. You've only just opened the connection: chances are all that has been sent is headers and they might have been enough to fill a network packet with little left over for content. Read the API doc for InputStream.available. It sounds awkward because it doesn't give you much: the number of bytes you can read without blocking. You're doing web requests, so expect plenty of blocking reads. Put a BufferedReader or something on that InputStream and invoke readLine() until you get to the end of stream. You'll see something much better.
Re: URLConnection inconsistency + SSCCE
Also note you are bombarding the server with 15 requests in an extraordinarily short time span, which could be recognized as a denial of service hacker attack (in other words the responses you are getting could be a preventative measure against this sort of attack).
Re: URLConnection inconsistency + SSCCE
I just had another look at your code. When you're doing stuff with URLConnection bear in mind that Java uses persistent connections:
HTTP Persistent Connections
And that entity you requested is longer than 2896 (first value printed) bytes:
Code :
sean@bulldozer:~$ curl -s -v http://en.wiktionary.org/wiki/amo?action=raw
* About to connect() to en.wiktionary.org port 80 (#0)
* Trying 91.198.174.226... connected
* Connected to en.wiktionary.org (91.198.174.226) port 80 (#0)
> GET /wiki/amo?action=raw HTTP/1.1
> User-Agent: curl/7.21.6 (x86_64-pc-linux-gnu) libcurl/7.21.6 OpenSSL/1.0.0e zlib/1.2.3.4 libidn/1.22 librtmp/2.3
> Host: en.wiktionary.org
> Accept: */*
>
* HTTP 1.0, assume close after body
< HTTP/1.0 200 OK
< Date: Wed, 14 Mar 2012 23:41:39 GMT
< Server: Apache
< X-Content-Type-Options: nosniff
< Cache-Control: public, s-maxage=0, max-age=2678400
< Last-Modified: Thu, 08 Mar 2012 16:36:45 GMT
< Vary: Accept-Encoding
< Content-Length: 4115
So it looks as though what's happening is that wiktionary has transferred nearly 3,000 bytes to you, but there's still another 1,000-odd to be transferred before the response is completed. Your next request will not be serviced because it's 'behind' the first one. Read each request completely and you won't get this output.
copeg makes a good point: if you're writing robots you should attempt to adhere to the Robots Exclusion Standard. Wiktionary.org's /robots.txt hasn't anything in it to bother you, but you should at least be naming your bot (set the User-Agent header - it'll be something like "Java 1.6" and many sites Disallow and occasionally block robots whose authors can't be arsed to name them). Rate-limiting isn't in the standard, but Crawl-delay is a common declaration - so some people do care about too-rapid requests!
The Web Robots Pages
Re: URLConnection inconsistency + SSCCE
Wow, thank you for all of the help! I managed to get the SSCEE to consistently return 4115 bytes and an appropriate response.
Again, than you!
Working program if anyone is curious
Code Java:
import java.net.*;
import java.io.*;
/**
* This class is a test of the URLConnection class
*
* @author Timothy Moore
*/
public class URLConnectionTest
{
public static void main(String[] args) throws Exception //No error handling to keep it short and sweet
{
long start = System.currentTimeMillis();
HttpURLConnection connection = (HttpURLConnection)new URL("http://en.wiktionary.org/wiki/amo?action=raw").openConnection();
connection.setRequestProperty("User-Agent", "Timothy Moore");
InputStream in = connection.getInputStream();
System.out.println(" Response Code = "+connection.getResponseCode());
DataInputStream dis = new DataInputStream(in);
int counter = 0;
try
{
while(true)
{
dis.readByte();
counter++;
}
}catch(EOFException exc)
{
System.out.println("End of file at "+counter+" bytes");
}
dis.close();
connection.disconnect();
long time = System.currentTimeMillis()-start;
System.out.println(" Time: "+time+"ms");
}
}
Re: URLConnection inconsistency + SSCCE
I'm glad you got your code working. All you need to do now is to edit your comment. URLConnection is working the way it's supposed to - even though it might seem a bit odd at first. Wiktionary.org is likely to give you an over-optimistic impression of how URLConnection works. If you do a lot of automated HTTP fetching, you'll discover a lot of variability in how quickly / reliably servers respond! If you're planning to write an application based on someone's API, remember to code in some caching, to set your User-Agent header, and to test how well your code works when you pull your network cable out!
Re: URLConnection inconsistency + SSCCE
Yeah, I finished the little application that I had need the URLConnection for, it seems to take anywhere from 250 to 1000ms to respond and be parsed. It ended up parsing a raw page from the wiktionary [The wiktionary's version of creating a program-readable version of a page]. Now that the URL Connection is working, it works perfectly. Usually only calls every few minutes max, and it's only shared in-house, so I don't have to worry about getting yelled at for flooding the server.
Again, thanks for all your help!