Following is a reasonably complete answer to the question "What is the limit on ISAM file sizes?" Note that this discussion applies to both standard ISAM and ISAM-A.
• | The 2GB OS file size limit: In order for any file to exceed 2GB, the file system must support it, and you must be running an "LFS" (Large File Support) version of A-Shell. All versions of A-Shell after 4.9.948 (December 2005) contain LFS, except the Windows versions which contain a "c" in the version string. So, other than those limitations, A-Shell does not impose any inherent limit on overall file size. |
• | ISAM data file size limit: Both ISAM and ISAM-A (aka ISAM-PLUS) use internal 32 bit record pointers. Although it might be possible, with care, to treat them as unsigned, by convention, these are signed and thus this imposes a limit of 2*31 (approximately 2 billion) records. If your record size was 512 bytes, that would impose a theoretical overall data file size of 2GB * 512 = 1TB. |
• | ISAM index file size limit: Both ISAM and ISAM-A also use internal 32 bit index pointers, although the index block size is different between the twoISAM-A uses 1024 byte index blocks, imposing a theoretical IDX file size limit of 2TB. ISAM 1.x used 512 byte index blocks, while ISAM 1.1 is configurable (see ISMBLD.LIT /B switch), supporting IDX block sizes from 512 to 16384 (1TB to 32TB theoretical limits). |
Note that the index size is dependent on the number of data records, key size(s), and amount of extra space to allow for performance and tree balancing reasons. So in practice this may lower the theoretical limit on the number of data records considerably.
The above numbers are all theoretical limits. In practice, you will likely run into some severe performance issues long before you get to those sizes. Some of the factors to consider here are:
• | File system performance on large files. All file systems slow down (quite substantially) when dealing with very large files, because of the need to employ some kind of hierarchical pointer structure in order to locate the physical disk block associated with a particular logical position in the file. The details may vary between file system types, but I have not been able to locate a definitive reference to which file system(s) are the best for such large files. |
• | Cache efficiency falls dramatically with such large files, because the nature of ISAM leads to random jumps all over the file. Other than the first couple of levels of the index, the rest of the accesses in both the index and data files is likely to uniformly spread, meaning that the cache efficiency will not be much better than the raw ratio of the amount of cache memory available to the size of the files. |
• | Because of the cache problem, you quickly run up against the raw random access performance of the disk drive. With a typical 7200 RPM drive offering 8 ms average seek time, your total time to read a random record is going to be 8 ms (seek) + 4 ms (latency) = 12 ms, or about 85 reads or writes per second (not counting transfer times, scheduling delays, etc.) So, if you wanted to scan a billion record file in index order, even if the index accesses are totally cached, each data read will probably require a seek, and thus it would take over 3000 hours! Operations requiring random index seeks, such as lookups, adds, and deletes, would be considerably slower still. |
Obviously you would want to go with the fastest possible drives (15K RPM and 6 ms average access would cut the overall access time by 1/3) and as much RAM as you can possibly get. Also, splitting the index and data across two drives would help considerably.
But the biggest help would come from restructuring your data to reduce the size of any individual file. Within reason, you'll get much better performance by having more, smaller files.
For example, even though it might not be that elegant, if you had a transaction history file with 120 million records, you would probably get better performance by splitting it into 12 files of 10 million records each (one file per month).