LLX > Neil Parker > Apple II > Speeding Up DOS 3.3

Speeding Up Loads in DOS 3.3

Introduction

Let's face it—Apple DOS 3.3 is infamous for it slowness. Anyone who has ever watched it load a hi-res image is familiar with the problem, waiting as first a little bit of the image appears, then a little bit more, eventually taking typically around twenty seconds to load the whole image. Clearly it doesn't need to be that way—Apple's ProDOS, for example, can usually load the same image in under five seconds.

So what's making it so slow, and what can we do about it?

An Apple DOS 3.3 disk's surface is divided into sectors of 256 bytes each. At a low level this is the only way to access the disk, in units of 256 bytes at a time.

But file accesses don't conform to one-sector-at-a-time rules. It's normal, and quite common, to want to access just part of one sector, or starting partway through one sector and finishing partway through another sector. So when a file opened, DOS sets aside a 256-byte area of memory (the data sector buffer) for accessing its sectors, and then copies bytes into or out of it, one byte at a time. That way programs can read or write anywhere in the file, without regard to sector boundaries.

But all that one-byte-at-a-time copying takes time—time during which the disk continues to spin. In fact, while the copying is taking place, chances are the file's next sector is spinning past the read/write head unnoticed, so that by the time DOS is ready for the next sector, it has to wait a whole disk revolution for it to come around again. Whenever you load a file with more than one data sector, DOS spends most of its time waiting for the disk to spin (this is called blowing revs).

Several third parties produced modified versions of DOS that are much faster, for example, Diversi-DOS, DAVID-DOS, ProntoDOS, etc. All of these work essentially the same way, by recognizing when a program is accessing a part of a file that starts at a sector boundary, and still has at least 256 bytes to go. In that case, they load or save the sector directly from the program's memory, skipping the data sector buffer and all that one-byte-at-a-time copying. This is usually sufficient to catch the file's next sector the first time it comes around instead of the second, and the file access is up to five times faster.

Below, a patch is developed for DOS 3.3 that speeds up loading, and hopefully clarifies the mechanics of how fast versions of DOS work. (Saving is not patched here—saving is more complicated than loading, and speeding it up requires additional memory, and a slightly different approach.)

Sources

In preparing what follows, I've relied heavily on the classic DOS reference, Beneath Apple DOS by Don Worth and Pieter Lechner, and on the final DOS source code, rescued from oblivion by David T. Craig and findable on the usual Apple II download sites. Several snippets of assembly language below are taken from an assembly listing made from this source code—the addresses in the listing differ from those in Beneath Apple DOS because the source code assembles a "master" DOS that loads from $1B00 to $3FFF and then relocates itself. Addresses can be matched with normal 48K DOS, and with Beneath Apple DOS, by adding $8000.

Technical Details

As with any non-trivial DOS patch, the fast load patch has to go in memory somewhere, and for the sake of compatibility it needs to go somewhere where it won't add to DOS's memory needs. The good news is that DOS 3.3 has three areas of unused memory inside itself where such patches can go (provided that they aren't too big). The bad news is that Apple released three versions of DOS 3.3, and in each version the amount of memory available in those three gaps is different. The first version has the most space available for patches like this, but it's the third release that DOS 3.3 users should prefer, since it (almost) fixes the notorious APPEND bug.

In order to have the benefits of both a working APPEND and fast loading, this patch assumes that my own APPEND patch, which is discussed in another article, has already been applied. As explained there, the APPEND patch leaves more free space in the three internal gaps that any of Apple's official DOS 3.3 releases, including more than enough space for a fast load patch. In particular, memory from $B692 through $B6FC is free, and the fast load patch will go there.

Of course there are drawbacks to this choice, in the form of software that assumes it can put its own code into DOS's internal gaps with impunity. The language-card version of Global Program Line Editor (GPLE), for example, overwrites memory from $B6B3 to $B6FC, so using it with this fast load patch is likely to be fatal. (Fortunately the 48K version of GPLE doesn't have this problem.)

Here's the routine that's responsible for DOS's slowness. This is the DOS file manager's "read a range of bytes" routine, which gets control whenever a request is made to read more than just a single byte.

2C96:              218 ;
2C96:              219 *   RNXBLK - READ NEXT BLOCK
2C96:              220 ;
2C96:        2C96  221 RNXBLK    EQU   *
2C96:20 B5 31      222           JSR   DTBLN          ; GO DECR LEN (NOT RTN IF=0)
2C99:20 A8 2C      223           JSR   GETBYT         ; GO GET BYTE
2C9C:48            224           PHA
2C9D:20 A2 31      225           JSR   MIBDA          ; GO MOVE BLOCK ADR AND INCR
2CA0:A0 00         226           LDY   #0
2CA2:68            227           PLA
2CA3:91 42         228           STA   (ZPGFCB),Y     ; SET DATA BYTE
2CA5:4C 96 2C      229           JMP   RNXBLK         ; GO FOR NEXT BYTE
2CA8:              230 ;
2CA8:              231 *   GETBYT - GET A DATA BYTE
2CA8:              232 ;
2CA8:        2CA8  233 GETBYT    EQU   *
2CA8:20 B6 30      234           JSR   LOCNXB         ; LOCATE NEXT BYTE
2CAB:B0 0B   2CB8  235           BCS   EOFIN          ; BR IF EOF
2CAD:B1 42         236           LDA   (ZPGFCB),Y     ; GET DAT BYTE
2CAF:48            237           PHA                  ; SAVE IT
2CB0:20 5B 31      238           JSR   INCRRB         ; INCR REC BYTE
2CB3:20 94 31      239           JSR   INCSCB         ; INCR SEC BYTE
2CB6:68            240           PLA                  ; GET SAVED BYTE
2CB7:60            241           RTS                  ; RETURN
2CB8:              242 ;
2CB8:4C 6F 33      243 EOFIN     JMP   ERROR5         ; GO TO EOF RTN

This just reads one byte, repeatedly, into the target memory, until all the requested bytes have been read. This is the routine that we need to patch into, to recognize when a whole sector can be read into directly into the target memory.

So under exactly what conditions can a whole sector be read directly into the target memory without all the one-byte-at-a-time copying?

The current file position must be an exact multiple of 256.
The number of bytes remaining to be read must be at least 256.

But that's not all. DOS maintains two current file position counters, the record/byte-within-record counter (updated by the JSR INCRRB above), and the sector/byte-within-sector counter (updated by JSR INCSCB). For figuring out whether we're on a sector boundary the sector/byte-within-sector counter is the only one needed, but for maximum friendliness to programs the record/byte-within-record counter should also be maintained accurately. But given an arbitrary record length, finding the new record/byte-within-record position after reading a whole sector requires a full long division, which would use up more memory than we'd like.

So this patch adds a third condition:

The record length must be 1.

Since both the LOAD and BLOAD commands set the record length to 1, this accounts for all of the multi-sector loads that DOS does—only file manager calls made by programs that explicitly set the record length to something other than 1 will miss out, and they won't miss out fatally—they'll just get the usual slow read.

The JSR DTBLN above tests whether or not the number of bytes remaining to be loaded is zero; if so it exits the file manager, otherwise it subtracts one from the number of bytes and returns to its caller. This turns out to be a good call to replace with our patch, which will look like this (my apologies for the GOTO's, but the crossed conditions make nice structured indentation difficult):

     IF the record length is not 1, OR the byte-within-sector is not 0, THEN GOTO p2
p1:  IF the number of bytes remaining to be read is at least 256, THEN GOTO p3
p2:  JMP DTBLN     (This is what would happen normally without the patch) 
p3:  Read the next sector directly into target memory
     Add 256 to the target address
     Add 1 to the sector-within-file counter
     Add 256 to the record-within-file counter
     Subtract 256 from the number of bytes remaining to be read
     GOTO p1

Then all the fast loading is handled by the patch, and if the conditions are not satisfied, the rest of the read loop takes over none the wiser.

The fun part is the "Read the next sector directly into target memory" part. DOS has a routine that almost does what we need—the JSR LOCNXB above. But it has two drawbacks, the first of which is that it always loads the sector into the data sector buffer, not the target memory. So when fast-loading, we need to patch it before calling it, and unpatch it after calling it.

But this leads directly to its second drawback: If the previous file file access was a write, it first writes the current sector to the disk. If chaos is not to ensue in that case, the sector that it writes must be the data sector buffer, not the target memory sector. Thus we must make sure the write happens before patching LOCNXB.

Taking care of the write is easy—the routine WRSECT at $AF1D tests whether the current data sector buffer needs to be written, and writes it if so. It's normally called form within LOCNXB, but if we call it ourselves beforehand then LOCNXB will see that it's already been done and doesn't need to be done again.

So now the patch looks like this:

     IF the record length is not 1, OR the byte-within-sector is not 0, THEN GOTO p2
p1:  IF the number of bytes remaining to be read is at least 256, THEN GOTO p3
p2:  JMP DTBLN     (This is what would happen normally without the patch) 
p3:  JSR WRSECT ($AF1D)
     Patch LOCNXB to read into target memory
     JSR LOCNXB ($B0B6)
     Unpatch LOCNXB
     Add 256 to the target address
     Add 1 to the sector-within-file counter
     Add 256 to the record-within-file counter
     Subtract 256 from the number of bytes remaining to be read
     GOTO p1

How about the patching and unpatching? There's a routine at $AFE4 that's called by LOCNXB to select the data sector buffer as the destination for the next sector read:

2FE4:              223 ;
2FE4:              224 ;MVSBA - MOVE SECTOR BUFFER ADR FOR I/O
2FE4:              225 ;
2FE4:        2FE4  226 MVSBA     EQU   *
2FE4:AC CB 35      227           LDY   CFCBSB         ; GET SECTOR BUFF ADR
2FE7:AD CC 35      228           LDA   CFCBSB+1
2FEA:8C F0 37      229 MSB1      STY   IBBUFP         ; SET IOB SECTOR
2FED:8D F1 37      230           STA   IBBUFP+1       ; BUFF PTR
2FF0:AE D6 35      231           LDX   DCBTRK         ; GET TRACK
2FF3:AC D7 35      232           LDY   DCBSEC         ; GET SECTOR
2FF6:60            233           RTS                  ; RTN

If we alter this to select the target memory pointer at $B5C3 and $B5C4 instead of the data sector buffer pointer, then LOCNXB will do exactly what we want. So here's what the patcher needs to look like:

UNPATCH   LDX   #$CB
PATCH     STX   $AFE5
          INX
          STX   $AFE8
          RTS

Then to install the patch we say,

          LDX   #$C3
          JSR   PATCH

And to remove the patch, just JSR UNPATCH.

All we need now to convert the pseudocode to assembly language is a few memory locations.

The file record length (from the OPEN call) is at $B5E8 and $B5E9.
The current byte-within-sector is at $B5E6.
The number of bytes remaining to be read is at $B5C1 and $B5C2.
The pointer to the target memory is at $B5C3 and $B5C4.
The current sector-within-file is at $B5E4 and $B5E5.
The current record number is at $B5EA and $B5EB.

So here's the full fast read routine:

FASTREAD  LDX   $B5E8          ;Record len lo
          DEX
          TXA
          ORA   $B5E9          ;Record len hi
          ORA   $B5E6          ;Byte in sector
          BNE   P2
          LDA   $B5C2          ;Range length hi
P1        BNE   P3
P2        JMP   $B1B5          ;DTBLN
P3        JSR   $AF1D          ;WRSECT
          LDX   #$C3
          JSR   PATCH
          JSR   $B0B6          ;LOCNXB
          JSR   UNPATCH
          BCS   P2             ;Back to DTBLN if LOCNXB ran out of sectors
          INC   $B5C4          ;Target address hi
          INC   $B5E4          ;Sector within file lo
          BNE   P4
          INC   $B5E5          ;Sector within file hi
P4        INC   $B5EB          ;Record number hi
          DEC   $B5C2          ;Range length hi
          BCC   P1             ;(Always taken)
;
UNPATCH   LDX   #$CB
PATCH     STX   $AFE5
          INX
          STX   $AFE8
          RTS

To activate this, replace the JSR DTBLN at $AC96 with a JSR FASTREAD.

But we're not quite done yet. If an I/O error occurs in LOCNXB, it exits the file manager instead of returning. If that happens during our routine, then LOCNXB will be left in its patched state, causing chaos for anything else that calls it. So we need to intercept the error exit from the file manager, and make sure that LOCNXB always gets unpatched.

When LOCNXB gets a disk error, the flow of control ultimately ends up here to translate the RWTS error code into a DOS error code:

30A1:AD F5 37      108 BADIO     LDA   IBSTAT         ; GET STATUS
30A4:A0 07         109           LDY   #CREVMM
30A6:C9 20         110           CMP   #IBVMME        ; WAS IT VOLUME MISMATCH
30A8:F0 08   30B2  111           BEQ   BD2            ; BR IF YES
30AA:A0 04         112           LDY   #CREPRO
30AC:C9 10         113           CMP   #IBWPER
30AE:F0 02   30B2  114           BEQ   BD2
30B0:A0 08         115           LDY   #CREIOE
30B2:98            116 BD2       TYA
30B3:4C 85 33      117           JMP   ERRORB         ; GO RTN

That last instruction is the one we need to intercept. To the above patch, add these two lines:

          JSR   UNPATCH
          JMP   $B385

And then replace the JMP ERRORB at $B0B3 with a JMP to it.

Summary of Patch

Here's the full patch, in hex, for a 48K DOS (remember, the APPEND patch must be applied first, to free up needed memory):

B692:AE E8 B5 CA 8A 0D E9 B5 0D E6 B5 D0 05 AD           (Fast read patch)
B6A0:C2 B5 D0 03 4C B5 B1 20 1D AF A2 C3 20 CC B6 20
B6B0:B6 B0 20 CA B6 B0 ED EE C4 B5 EE E4 B5 D0 03 EE
B6C0:E5 B5 EE EB B5 CE C2 B5 90 D8 A2 CB 8E E5 AF E8
B6D0:8E E8 AF 60 20 CA B6 4C 85 B3

AC97:92 B6                                               (Hook into DOS)

B0B4:D4 B6                                               (Hook into error exit)

Downloads

Here's a DOS 3.3 disk with software on it that applies the fast load patch. The APPEND patch has already been applied to the DOS on this disk.

FastLoad.sdk (Shrinkit DOS 3.3 disk, 11697 bytes)
FastLoad.zip (ZIPped disk image, 9202 bytes)

The disk image contains two programs, one which patches the currently-running DOS 3.3 in memory, and one which patches the DOS image on a disk.

Remember, there are a few program that aren't compatible with this patch, including the language card version of GPLE—if you want to use GPLE with this patch, use the 48K version.

LLX > Neil Parker > Apple II > Speeding Up DOS 3.3

Original: December 27, 2022