Does the Scanner class use an internal buffer?
This is actually a pretty simple question that I'm surprised I'm unable to find the answer to. Recently, I was advised to use a Scanner to read a large file in tokens, rather than using a BufferedReader and then splitting the Strings it puts out. While this worked wonders for my code, I've found the Scanner to work much more slowly and I suspect the problem is its lack of a buffer... Which it may or may not have. I've found conflicting evidence on the 'net.
So here's my question: Does Scanner use a buffer, if so how large and how can I mess with it? If not, is there any way to get the performance of a BufferedReader with the per-token reading of a Scanner? Because at this point, it seems faster to use a BufferedReader and do String.split than it is to use the Scanner for basically the same task.
For context, the code loading the file looks like this:
Code java:
private void LoadComparisonList(File comparisonFile)
{
String readLine;
String[] readArray = new String[2];
if(!comparisonFile.exists() || !comparisonFile.isFile() || !comparisonFile.canRead()
|| comparisonFile.length() == 0)
{
this.errorcode = 2;
return;
}
try
{
Scanner scanner = new Scanner(comparisonFile);
scanner.nextLine();
do
{
readArray[0] = scanner.next();
readArray[1] = scanner.nextLine();
this.comparisonList.add(readArray);
}
while(scanner.hasNext());
scanner.close();
}
catch (IOException exception)
{
this.errorcode = 2;
}
}
}
This reads and discards the file's header line, which is just column names, and then proceeds to read the first element of the column to save as a header and everything else (another five tokens) as a footer, saving everything into an ArrayList<String[]>. It works just fine, it just takes ~27 seconds for a 90MB, ~5 000 000 line file where virtually the same action done through a BufferedReader completes in around 4-5, if I remember correctly.
Re: Does the Scanner class use an internal buffer?
I don't know the answer, but if it isn't specified in the API or JLS, it may be undefined and JVM dependent.
Edit:
Google has given me more info:
Does a Java Scanner implicitly create a buffer even if you do not pass it one? - Stack Overflow
Re: Does the Scanner class use an internal buffer?
Your jdk folder should contain src.zip. Check that out, find the Scanner class, and you can see for yourself exactly what it does. I'd be curious to see what you find.
Re: Does the Scanner class use an internal buffer?
I can't seem to find that on my system, I'm afraid. I checked the Java folder and the Eclipse folder but I can't find a source file. I'm sure it has to exist somewhere for my system to know what's in the individual classes, I just can't find it. Even Eclipse doesn't seem able to find it, since every time it uses one of the base classes when debugging, I just get a "Source code not available" screen. It's why I had to weed out those steps when debugging. I'm not sure I'd be able to understand what the classes do even if I did find the source, to be honest. I'm not that good at Java.
Re: Does the Scanner class use an internal buffer?
It depends on how your system is setup, but for example my JDK folder is:
C:\Program Files\Java\jdk1.7.0_07
In that directory, I have a src.zip, and inside that I have a Scanner.java file (inside java/util within the zip).
But I will say that the short answer to your question, from reading the source, is that Scanner uses a CharBuffer internally.
Re: Does the Scanner class use an internal buffer?
Yeah, I checked my Java folder, but all it has is JRE folders for Java 6 and 7 (I'm still using 6, by the way, legacy stuff). No JDK. As far as Windows 7 can be trusted, I ran a search just in case I wasn't aware of where my Java folders were, but no src.zip turned up.
Re: Does the Scanner class use an internal buffer?
Quote:
Originally Posted by
Fazan
Yeah, I checked my Java folder, but all it has is JRE folders for Java 6 and 7 (I'm still using 6, by the way, legacy stuff). No JDK. As far as Windows 7 can be trusted, I ran a search just in case I wasn't aware of where my Java folders were, but no src.zip turned up.
Strange. What jdk are you using?
Either way, you can download the source either as part of the full JDK or standalone from this page: Java SE Downloads
Re: Does the Scanner class use an internal buffer?
At the risk of revealing that I'm not entirely certain I know what a Java Development Kit is, I use Eclipse for code-writing purposes and that's about the extent of it. Beyond that, my only other Java-related software are the JRE packages, and even then I only have 6 and 7.
Either way, thank you for the link. I'll ook into it.
Re: Does the Scanner class use an internal buffer?
Quote:
Originally Posted by
Fazan
At the risk of revealing that I'm not entirely certain I know what a Java Development Kit is, I use Eclipse for code-writing purposes and that's about the extent of it. Beyond that, my only other Java-related software are the JRE packages, and even then I only have 6 and 7.
Either way, thank you for the link. I'll ook into it.
Okay, gotcha. I assume that eclipse has its own JDK tucked away somewhere, but a JDK is simply the set of tools that compile your java code into bytecode (namely the javac tool). Compare that to the JRE, which is the set of tools that run compiled bytecode. Eclipse might hide it, but behind the scenes it has to be using a JDK, even if it's not installed where you'd expect. Most people install a JDK like I have above so that they can compile with the command prompt. The source comes with the "real" (non-eclipse) JDK, which is another reason it's a good thing to have.
Re: Does the Scanner class use an internal buffer?
As far as I know the Scanner class is a convenient Regex-wrapper for any incoming stream. What stream you pass to it will determine if the file is buffered in memory or not.
Code java:
Scanner buffed_reader = new Scanner(new BufferedReader(new FileReader("file.txt"))); // file gets buffered into memory, usually faster
Scanner unbuffed_reader = new Scanner(new FileReader("file.txt")); // file isn't buffered into memory
That being said, I don't know what the behavior is if you pass a file to the Scanner object directly (though it sounds like it isn't).
It's also possible that it's the Regex side which is slowing your application down. If all you need is basic read-line stuff Scanners may be overkill. Scanners work best when you're parsing the data while your reading it in (such as reading in a file of numbers).
Re: Does the Scanner class use an internal buffer?
Quote:
Originally Posted by
helloworld922
As far as I know the Scanner class is a convenient Regex-wrapper for any incoming stream. What stream you pass to it will determine if the file is buffered in memory or not.
Code java:
Scanner buffed_reader = new Scanner(new BufferedReader(new FileReader("file.txt"))); // file gets buffered into memory, usually faster
Scanner unbuffed_reader = new Scanner(new FileReader("file.txt")); // file isn't buffered into memory
That being said, I don't know what the behavior is if you pass a file to the Scanner object directly (though it sounds like it isn't).
It's also possible that it's the Regex side which is slowing your application down. If all you need is basic read-line stuff Scanners may be overkill. Scanners work best when you're parsing the data while your reading it in (such as reading in a file of numbers).
From the source, the Scanner(File) constructor wraps the File in a FileInputStream:
Which is converted into a Readable:
Code java:
public Scanner(ReadableByteChannel source) {
this(makeReadable(Objects.requireNonNull(source, "source")),
WHITESPACE_PATTERN);
}
And finally passed into the main constructor:
Code java:
private Scanner(Readable source, Pattern pattern) {
assert source != null : "source should not be null";
assert pattern != null : "pattern should not be null";
this.source = source;
delimPattern = pattern;
buf = CharBuffer.allocate(BUFFER_SIZE);
buf.limit(0);
matcher = delimPattern.matcher(buf);
matcher.useTransparentBounds(true);
matcher.useAnchoringBounds(false);
useLocale(Locale.getDefault(Locale.Category.FORMAT));
}
...which creates the CharBuffer. The CharBuffer is used in the base method for reading input:
Code java:
private void readInput() {
if (buf.limit() == buf.capacity())
makeSpace();
// Prepare to receive data
int p = buf.position();
buf.position(buf.limit());
buf.limit(buf.capacity());
int n = 0;
try {
n = source.read(buf);
} catch (IOException ioe) {
lastException = ioe;
n = -1;
}
if (n == -1) {
sourceClosed = true;
needInput = false;
}
if (n > 0)
needInput = false;
// Restore current position and limit for reading
buf.limit(buf.position());
buf.position(p);
}
Re: Does the Scanner class use an internal buffer?
I suppose I should have performed benchmarks before making assumptions :P
Code java:
import java.io.BufferedOutputStream;
import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.io.IOException;
import java.io.PrintStream;
import java.util.Scanner;
public class StreamTest
{
public static void gen_file(String file_name, int lines) throws FileNotFoundException
{
PrintStream out = new PrintStream(new BufferedOutputStream(new FileOutputStream(file_name)));
for (int i = 0; i < lines; ++i)
{
out.print("header,");
for (int j = 0; j < 5; ++j)
{
out.print(i * 0.33f + j + ",");
}
out.println();
}
out.close();
}
public static void main(String[] args) throws IOException
{
String file_name = "data.txt";
gen_file(file_name, 100000);
Scanner scan;
BufferedReader read;
int times = 10;
long scan_times[] = new long[times];
long buff_times[] = new long[times];
System.out.println("scanner test");
for (int i = 0; i < times; ++i)
{
scan = new Scanner(new BufferedReader(new FileReader(file_name)));
scan.useDelimiter(",");
long start_time = System.currentTimeMillis();
while (scan.hasNextLine())
{
// skip header
scan.next();
// read 5 doubles
scan.nextDouble();
scan.nextDouble();
scan.nextDouble();
scan.nextDouble();
scan.nextDouble();
// finish line
scan.nextLine();
}
long end_time = System.currentTimeMillis();
scan_times[i] = end_time - start_time;
System.out.println(scan_times[i]);
scan.close();
}
System.out.println("buffered reader test");
for (int i = 0; i < times; ++i)
{
read = new BufferedReader(new FileReader(file_name));
long start_time = System.currentTimeMillis();
String line;
while ((line = read.readLine()) != null)
{
String[] split = line.split(",");
// convert items to doubles
Double.parseDouble(split[1]);
Double.parseDouble(split[2]);
Double.parseDouble(split[3]);
Double.parseDouble(split[4]);
Double.parseDouble(split[5]);
}
long end_time = System.currentTimeMillis();
buff_times[i] = end_time - start_time;
System.out.println(buff_times[i]);
read.close();
}
}
}
This code does creates a reader stream, reads in a line which consists of a header and 5 doubles, all separated by commas. It also generates a file to run the test with.
Results:
scanner test
3162
2855
2787
2800
2799
2797
2760
2755
2779
2756
buffered reader test
209
128
112
108
110
108
108
108
107
107
So it is looks like it is indeed true that Scanner's are significantly slower than BufferedReaders, even after buffering.
Re: Does the Scanner class use an internal buffer?
Quote:
Originally Posted by
KevinWorkman
Okay, gotcha. I assume that eclipse has its own JDK tucked away somewhere, but a JDK is simply the set of tools that compile your java code into bytecode (namely the javac tool). Compare that to the JRE, which is the set of tools that run compiled bytecode. Eclipse might hide it, but behind the scenes it has to be using a JDK, even if it's not installed where you'd expect. Most people install a JDK like I have above so that they can compile with the command prompt. The source comes with the "real" (non-eclipse) JDK, which is another reason it's a good thing to have.
Thank you. I suspected, but it's been a while since I've used precise terminology (I'm not actually a native English speaker) so I'm not sure if I was ever clear on this. I'll see about getting the actual JRE, though I still prefer Eclipse. The ease of use of the editor's functions is too good to pass up. At this point I'm hooked on code-complete as a means of avoiding spelling errors (sloppy typist here).
Quote:
Originally Posted by
helloworld922
As far as I know the Scanner class is a convenient Regex-wrapper for any incoming stream. What stream you pass to it will determine if the file is buffered in memory or not.
Code java:
Scanner buffed_reader = new Scanner(new BufferedReader(new FileReader("file.txt"))); // file gets buffered into memory, usually faster
Scanner unbuffed_reader = new Scanner(new FileReader("file.txt")); // file isn't buffered into memory
That being said, I don't know what the behavior is if you pass a file to the Scanner object directly (though it sounds like it isn't).
Wait, I can pass a BufferedReader to a Scanner? But I thought its constructor expected a Stream child, not a Reader child? Yes, if I can pass a BufferedReader, then I'd expect the Scanner's behaviour to build on that of the underlying reader object, so I'd expect it to use said reader's buffer. I didn't know I could do that, though.
Quote:
Originally Posted by
helloworld922
It's also possible that it's the Regex side which is slowing your application down. If all you need is basic read-line stuff Scanners may be overkill. Scanners work best when you're parsing the data while your reading it in (such as reading in a file of numbers).
I'm not just reading the data, I need to split it into tokens. Each row comes in six parts - one header and five number columns. Originally, I was using a BufferedReader and then post-splitting and converting the coming Strings, which really isn't a good idea. After having my post moderated in another thread, it gave me the idea of using a StringTokenizer, instead. I haven't read up enough on it to know for certain, but this seems like an easier way to get tokens out of a file reader than a Scanner, which does indeed seem to want to parse my data. I suspect that might be faster, though I don't know if the StringTokenizer itself won't reintroduce some slowdown.
In any event, the previous memory problem and the Scanner's slowdown forced me to admit defeat and move the software to our server so it can use more operating memory. It's either that or take ages to accomplish anything.