How to use MappedByteBuffer efficiently with extremely large files
I am currently working on a database-like software that is meant to process very large amounts of data. Most of this data is kept in a number of files on the disk. One thing to note is that these files exist only temporarily, i.e. they are created when the program is executed, and deleted when the Java VM shuts down. Another thing to note is that they get extremely big, in the range of 2-20 GB for practical applications of the software.
My supervisor told me to use a MappedByteBuffer to map parts of this file to the memory, and said that it was a lot more efficient than simply using a FileChannel. So I read the Java API for MappedByteBuffer and did some experiments to figure out how it works.
Since the files are way too big to ever fit into the memory, my basic idea was to map a specific part of each file to the memory, which ranges from a point A to a point B. Every time the file is accessed by a read or write operation, I check whether or not this operation is inside that window. If it isn't, I map a different part of the file to the buffer so that the starting point of the requested operation is now in the middle between A and B.
I ran some tests with this and got mixed results. Some test cases run nearly twice as fast compared to when I simply use a FileChannel. In others, it's extremely slow or simply crashes with an I/O exception.
Before I get into specific details, I'd like to ask if there are any tutorials or similar things that deal with the scenario. I have serached for things like this for quite a while now, but all I found were tutorials and examples that use a pre-existing file, which is mapped to the memory in its entirety. I also don't know anybody who has experience in using a MappedByteBuffer that I could talk to.
Since I wasn't able to find any examples about what I'm trying to do, I'm also wondering whether or not using a MappedByteBuffer even makes sense in my case, or if I should just scratch the idea and simply use a FileChannel instead. However, the increase in performance I get in some test cases is very hard to ignore.
Any help would be very appreciated. If you want more information, I'll be happy to provide it.
Re: How to use MappedByteBuffer efficiently with extremely large files
What kind of access patterns do you have? Check to make sure you're not swapping buffers too often (you can probably do a few tests to determine what "too often" means). If possible, try re-arranging your data set so you have to swap buffers as little as possible.
Re: How to use MappedByteBuffer efficiently with extremely large files
Quote:
Originally Posted by
helloworld922
What kind of access patterns do you have? Check to make sure you're not swapping buffers too often (you can probably do a few tests to determine what "too often" means). If possible, try re-arranging your data set so you have to swap buffers as little as possible.
It depends on the test case. In some cases, the file is accessed from start to finish in an orderly fashion, and other cases rapidly access random positions in the file. The cases that go through the file in a straight line are the ones where I get a vast increase in performance, whereas the cases that randomly access the file take unreasonably long.
I could construct my test cases in such a way that the buffers aren't swapped "too often", but that probably wouldn't be a good idea in the long term since the behaviour of the program will be heavily dependent on user input for practical applications.
I think I'll ditch the MappedByteBuffers and stick to using a FileChannel.
Re: How to use MappedByteBuffer efficiently with extremely large files
I didn't mean format your test cases so they would give you the best performance, but format them to best represent how the user likely will use the program. I don't know what kind of data you have, but it might have some usage pattern by your customers that you can exploit for performance reasons. You could have the customer test out the software and provide usage statistics which you can analyze to determine what way to best format your data so you get good performance as much as possible.
You could also provide the generic random access commands using FileChannel and add some "high capacity" commands designed for accessing large contiguous blocks and use MappedByteBuffers then. This way you get decent performance when you have random access and still get large performance boosts for contiguous access.