TECHNICAL FIELD
The invention relates generally to the field of computers and, more particularly, to a method and apparatus for aligning data flowing between two or more buses in a computer to increase memory read access bandwidth.
BACKGROUND
Computers are continually progressing in several key areas, including speed of operation and peripheral device support. However, progression in some areas can often impede progression in other areas. For example, to many conventional computersutilize a cache memory system for storing one or more duplicate memory portions of a computer's main memory. The cache thereby allows a device such as a central processing unit ("CPU") to make multiple, quick accesses to a localized area of the mainmemory. Typically, a cache has relatively fast access times for read and write operations being performed by the CPU, as compared to the main memory, but the cache is more expensive than the main memory. Therefore, a balance must be struck between thesize of the cache relative to the main memory of the computer.
The cache is typically organized in cache lines, which are groupings of data words. For example, a cache line may consist of sixteen data words. Also, read and write operations to the cache are focused on one entire cache line at a time. Readoperations are fairly straightforward, but write operations can present several difficulties. One difficulty revolves around what to do with stale main memory. A stale main memory occurs when a write operation has been performed on the cache so thatthe cache no longer duplicates, or is no longer coherent with, the corresponding portion of main memory. To resolve this difficulty, several types of caches have been commonly implemented, two types being "write back" and "write through". Each of thetwo cache types has benefits and drawbacks well known by those of ordinary skill in the art.
In addition to the CPU, the cache, and the main memory, a computer supports a variety of peripheral devices. The peripheral devices are typically connected to each other through a peripheral bus and interface the CPU, cache and main memorythrough a bridge device. Due to the operation of the peripheral devices, implementation of the write back cache presents additional difficulties. This is because the peripheral devices request read operations to portions of the computer's main memorywhich are frequently stale. As a result, when a peripheral device has control of the peripheral bus and begins to perform a memory read operation, i.e., the peripheral device becomes a bus mastering agent, a determination must first be made as towhether the requested portion of the main memory needs to be updated by the cache. Therefore, a snoop operation is typically employed to determine the state of the requested portion of main memory in the cache.
Since both the CPU and the peripheral devices are accessing the computer's main memory and the cache, the slave device should support quick operation of the CPU as well as adequate support of the peripheral devices. To provide such support, theslave device often utilizes one or more memory management techniques. For example, the slave device may utilize look-ahead, or "speculative", techniques for increasing the bandwidth, or rate of data transfer, from memory to the CPU or the bus masteringagent. It is understood that a variety of speculative techniques are well known in the art.
Despite the improvements provided by the use of such speculative techniques, there are certain instances when a single bus mastering agent attempts to monopolize the bus. One such instance is when the bus mastering agent requests a misalignedmemory read operation. A misaligned memory read operation is a memory access to a location that does not begin at the beginning of a cache line. In the example above where one cache line is sixteen data words long, a misaligned memory read operationmay attempt to read data beginning with the fourth data word of the cache line.
To prevent a bus mastering agent from monopolizing the bus, the slave may utilize a monitored latency period. A monitored latency period is a limit on how long a single bus mastering agent can own or control the bus. Once that limit has beenreached, the bus mastering agent is preempted, thereby terminating its ownership of the bus. While this technique prevents monopolization of the bus by a single bus mastering agent, it sometimes has an overall effect of slowing down the computer becausethe bus mastering agent must again arbitrate for ownership of the bus to finish its request. Furthermore, when speculative read techniques are employed, in conjunction with the monitored latency period, the benefits of the speculative read are somewhatdiminished in situations where the data is cached but the bus mastering agent is preempted.
SUMMARY
In carrying out principles of the present invention, one embodiment thereof provides an aligning buffer for supporting memory read operations where speculative techniques are employed in a computer. The aligning buffer may reside between a localbus and a peripheral bus of the computer. The peripheral bus, which may be a peripheral component interconnect ("PCI") bus, connects one or more bus-mastering peripheral devices and sends addresses and data back and forth between the peripheral devices.
The aligning buffer is part of a bridge device connected between and selectively allocating control of the two buses. The bridge device allows addresses and data to flow back and forth between the two buses, and also controls when certainperipheral devices connected to the peripheral bus may access the main memory. When a peripheral device requests a read operation from a stale portion of the main memory, the bridge device causes the cache to write back specific portions of data to themain memory so that the peripheral device can receive updated data. Furthermore, the bridge device utilizes an access-dependent latency period for restricting the amount of time in which the peripheral device may have ownership of the bus. The specificportions of data coming from the cache are written back to memory as a cache line.
While the specific portions of data from the cache are being written to the main memory, the bridge device also provides the data to the requesting peripheral device. The method and apparatus described herein are particularly advantageous ininstances when the peripheral device initiates a memory access with a misaligned address. In such instances, several cache lines of data are written back to the main memory while simultaneously being supplied to the peripheral device. However, by thetime the last cache line of data is being written back to the main memory, the peripheral device has either terminated the cycle because its own buffers are full, or has been preempted due to latency constraints. The data from the last cache line is,however, stored in the aligning buffer. Therefore, when the peripheral device requests another memory access to complete the preempted read operation, the aligning buffer can quickly supply the data to the peripheral device and pass the next address forprefetch operation to other blocks in the bridge device.
A technical advantage is that the overall bandwidth and speed of operation of the computer is increased.
Another technical advantage is that the peripheral device does not have to access main memory to complete a terminated read operation.
Another technical advantage is that after a first misaligned memory access that would have caused a preemption of the peripheral device without the aligning buffer, a subsequent memory access by the peripheral device will likely be an alignedmemory access.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a diagram of a computer including two cache memories, a main memory, a plurality of peripheral devices, and a bridge embodying features of one embodiment of the present invention.
FIG. 2 is a block diagram of the bridge of FIG. 1.
FIG. 3 is a diagram of three cache lines from one of the caches, a portion of the bridge, and an internal buffer from one of the peripheral devices, all of FIG. 1.
FIG. 4 is a diagram of the cache line buffer of the bridge of FIG. 1.
FIGS. 5a and 5b are a flowchart representing an operation of the embodiment of FIG. 1.
FIG. 6 is a state diagram representing an operation of the cache line buffer of FIG. 4.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
Referring to FIG. 1, reference numeral 10 generally designates a computer embodying features of one embodiment of the present invention. The computer 10 includes a CPU 12 connected to a local bus 14. The CPU 12 has an internal, first levelcache ("L1 cache") 16, but also utilizes an external, second level cache ("L2 cache") 18 connected to the local bus 14. Both caches 16, 18 are write back type caches. Other devices are also connected to the local bus 14, including a bridge 20. Thebridge 20 performs several functions such as controlling access to a main memory 22 of dynamic random access memories ("DRAMs") through a memory bus 23.
The bridge 20 also interfaces the local bus 14 with a peripheral bus 24. In the preferred embodiment, the peripheral bus 24 is a peripheral component interconnect ("PCI") bus, as defined by Intel Corp. of Santa Clara, Calif. The PCI bus 24supports many different peripheral devices, including, for example, a hard drive ("H/D") 26, a local area network interface ("LAN") 28, and a secondary bridge 30 for further supporting additional buses such as an industry standard architecture ("ISA")bus 32.
Referring to FIG. 2, the bridge 20 contains several sub-components interconnected by one or more of the three buses 14, 23, 24 or by other buses not shown. As mentioned above, the bridge 20 includes a bus-specific controller 42 for controllingaccess to the main memory 22. The controller 42 also controls other functions including the transfer of data between the buses 14, 23, 24. It is understood that the controller 42 may represent one or more different devices, the basic functions thereofbeing well understood by those of ordinary skill in the art.
In the preferred embodiment, a single cache line consists of sixteen data words, but to simplify the following description, a cache line of two data words will be used hereafter. Furthermore, corresponding buffer sizes and groups of data wordsin the following description are also reduced for ease of description, it being understood that different sizes of cache lines and buffers are easily extrapolated by those of ordinary skill in the art. The bridge 20 also includes buffering arrangements46 used for temporary storage and an aligning cache line buffer 48.
Assume for example that the LAN 28 has become the bus mastering agent and is requesting to fill an internal four data word buffer with four data words from the main memory 22, such data words also residing in the L2 cache 18. A cache controller(not shown) first determines whether the data words in the accessed portion of the main memory 22 are stale. In the present example, the data words are stale, so the cache controller then initiates a write back operation from the L2 cache 18. Simultaneous with supplying the LAN 28 with the requested data words, the bridge 20 utilizes the buffering arrangements 46 and the memory controller to update the main memory 22.
In actuality, the L2 cache 18 operates by supplying one cache line at a time during the write back operation. Therefore, if the LAN 28 is requesting a read operation that is aligned to a cache line of the L2 cache 18, i.e., an aligned memoryaccess, the read operation will typically be completed before the LAN 28 is preempted. However, there are instances when the LAN 28 will be preempted before completing the operation. For example, if the LAN 28 is requesting a read operation that is notaligned, i.e., a misaligned memory access, the read operation will typically not be completed before the LAN is preempted. In these instances, the LAN 28 must make another read request before its internal buffer is filled, as described in greater detailbelow.
Referring to FIG. 3, three cache lines of the L2 cache 18 are designated by the reference numerals 50a, 50b, and 50c. The first cache line 50a includes consecutive data words W1.1, W1.2, the second cache line 50b includes consecutive data wordsW2.1, W2.2, and the third cache line 5Oc includes consecutive data words W3.1, W3.2. The LAN 28 includes a four-data-word buffer 56 that is to be filled with data read from main memory 22 that also resides in the L2 cache 18. In this example, the LAN28 requests a misaligned memory access, starting with the data word W1.2 of the first cache line 50a and ending with the data word W3.1 of the third cache line 50c. As stated above, all the data being requested by the LAN 28 is stale.
When the misaligned memory access is received by the bridge 20, the bridge first performs a write back operation on the entire first cache line 50a from the L2 cache 18 into the main memory 22, even though the first data word W1.1 is not needed. The bridge 20 also supplies the second data word W1.2 to the LAN 28, which it copies into its buffer 56. Afterwards, the bridge 20 performs a write back operation on the second cache line 50b from the L2 cache 18 into the main memory 22. Similarly, theLAN 28 copies both data words W2.1, W2.2 into its buffer 56. Since the LAN 28 requires one more data word W3.1 to fill its buffer 28, on the next write back cycle the bridge 20 performs for the third cache line 50c, the LAN 28 reads the data word W3.1and terminates the memory access, effectively ending the LAN's ownership of the PCI bus 24. The write back cycle for the third cache line 50c completes regardless of the memory access ending on the PCI bus 24. The speculative reads will cause writeback cycles on consecutive cache lines 50d and 50c. Latency, however, is not based on the number of transfers, but the number of consumed clock cycles by the master. As a result, the data word W3.2 will not be supplied due to latency issues and the LAN28 is preeempted.
The cache line buffer 48 stores the last cache line 50c read from the L2 cache, which in this example includes the data words W3.1, W3.2. This is done in anticipation that the LAN 28 will re-arbitrate to become a bus mastering agent to continue,reading from the contiguous address from the last memory access. When the LAN 28 successfully becomes a bus mastering agent again, it will request a mis-aligned memory access, stating with the data word W3.2 of the third cache line 50c. Because a copyof the third cache line 50c resides in the cache line buffer 48, the data can quickly be returned to the LAN 28 before it is preempted. AU subsequent memory access by the LAN 28 will be aligned accesses, e.g., the next memory access by LAN 28 will be tothe fourth cache line 50d starting with data word W4.1.
Referring to FIG. 4, one implementation of the cache line buffer 48 includes an aligning address buffer 48a, a one cache line (two data word) aligning data buffer 48d, buffer logic 48c, and various control signals, including a latch enable signalLEN, an output enable signal OE, an advance signal ADV, and a match signal MTCH. The buffer logic 48c interfaces with the bus specific controller 42. The address buffer 48a and the data buffer 48d interface with the buffering arrangements 46 aid thePCI bus 24. During a cache write back operation, such as the one described above with reference to FIG. 3, the controller 42 informs the buffer logic 48c when the third and final cache line 50c is being written to the main memory 22. At this time, thebuffer logic 48c activates the advance signal ADV to store the address of the cache line 50c in the address buffer 48a. Simultaneously, the buffer logic 48c also activates the latch enable signal LEN to store the two data words W3.1, W3.2 of the cacheline in the data buffer 48d. As a result, the cache line 50c is both written to the main memory 22 and stored in the cache line buffer 48.
Still using the example above, when the LAN 28 becomes the bus mastering agent again and begins its next memory access starting with data word W3.1, it drives a corresponding address on the PCI bus 24. The corresponding address is compared tothe address stored in the address buffer 48a. Since the two address are the same, the address buffer 48a asserts the match signal MTCH to inform the buffer logic 48c that the data buffer 48d already has the data being requested by the LAN 28. Thebuffer logic 48c informs the controller 42 to prevent it from performing additional work on the memory access. Then, the buffer logic 48c asserts the byte enable signal OE to allow the data buffer 48d to drive the data, which includes data words W3.1and W3.2. The LAN 28 may then proceed to its next memory access request.
Referring to FIGS. 5a and 5b, reference numeral 100 designates a flow sequence illustrating operational steps used by the bridge 20 when supporting accesses to the main memory 22 by one or more of the peripheral devices 26, 28, 30. Executionbegins at step 102, where the peripheral devices that has become a bus mastering agent provides an initial address location for a read operation on the main memory 22. Also, a counter (not shown) inside the controller 42 is initiated, the counter beingused to check a monitored latency period ("MLP") discussed in greater detail below. Execution then proceeds to step 104, where a determination is made as to whether the requested address location references data that is stored inside the cache linebuffer 48. If the data is stored inside the cache line buffer 48, execution proceeds to step 106 where the data stored in the cache line buffer is supplied to the bus mastering agent. Execution then proceeds to step 108, where a determination is madeas to whether the bus mastering agent has been preempted, as discussed in greater detail below.
If at step 104 it is determined that the data corresponding to the requested address location is not stored inside the cache line buffer 48, execution proceeds to step 110. At step 110, a determination is made as to whether the data stored inthe main memory 22 at the requested address is stale. If the data is not stale, execution proceeds to step 112, where the data stored in the main memory 22 at the requested address is supplied to the bus mastering agent. Execution then returns to step108, where the counter is checked to determine if the MLP has expired. If the MLP has expired, execution proceeds to step 114 where the bus mastering agent is preempted and the flow sequence ends. If the MLP has not expired, execution proceeds to step116, where the next requested address is retrieved from the bus mastering agent. Execution then proceeds to step 110.
If at step 110 it is determined that the data stored in the main memory 22 at the requested address is stale, execution proceeds to step 118 where a determination is made as to whether the data stored in the L2 cache 18 that corresponds to therequested address is stale. If the data is not stale, execution proceeds to step 120, wherein a burst read operation is initiated on the L2 cache 18 beginning with the first cache line that includes the data from the requested address. Execution thenproceeds to step 122, discussed in greater detail below.
If at step 118 it is determined that the data stored in the L2 cache 18 is stale, execution proceeds to step 124 where a burst read operation is initiated on the L1 cache 16 beginning with the first cache line that includes the data from therequested address. Execution then proceeds to step 126, where the data retrieved from the L1 cache 16 is written back to the L2 cache 18. Execution then proceeds to step 122.
At step 122, one cache line of the data retrieved from either the L1 cache 16 or the L2 cache 18 is stored inside the buffering arrangements 46. At step 128, the data stored inside the buffering arrangements 46 is written back to the main memory22. At step 130, which occurs simultaneously with step 124, the data stored inside the buffering arrangements 46, along with the requested address, are stored in the cache line buffer 48. At step 132, the counter is checked to determine if the MLP hasexpired. If the MLP has expired, execution proceeds to step 114 where the bus mastering agent is preempted and the flow sequence ends.
If at step 132 the MLP has not expired, execution proceeds to step 136 where data stored inside the buffering arrangements 46 is supplied to the bus mastering agent. At step 138, the next requested address is retrieved from the bus masteringagent. At step 140, a determination is made as to whether the retrieved address follows the predetermined burst sequence. If so, execution returns to step 122. If the retrieved address does not follow the predetermined burst sequence, executionreturns to step 110.
As a result, a substantial savings in bus bandwidth for both the local bus 14 and the PCI bus 24 is generated by the preferred embodiment described herein due to the time savings provided by the cache line buffer 48. Furthermore, the preferredembodiment also realigns the LAN 28 after its first misaligned memory access by only retrieving and storing complete cache lines of data. Therefore, the next memory access from the LAN 28 will likely be an aligned memory access and will not be preemptedbefore completion.
Referring to FIG. 6, reference numeral 200 designates a state diagram illustrating operational steps and functional states for the cache line buffer 48 of FIG. 4. To best describe the state diagram 200, Table 1 below describes all the potentialstate transitions from a present state ("PS") to a next state ("NS") and the conditions required for each transition.
TABLE 1 ______________________________________ PS NS Condition ______________________________________ A A 1. The cache line buffer 48 is in an idle state OR 2. For all input/output bus cycles. A B 3. A read cycle on the PCI bus 24 isinitiated to an address in the address buffer 48a AND 4. A snoop to the caches 16, 18 did not require a write back operation. B C 5. Inform the controller 42 to prevent additional work AND 6. Pass a prefetch address for the next memory access. CA 7. Data is supplied to the PCI bus 24 AND 8. Byte enables are asserted on the PCI bus to allow the bus master to access the data. A D 9. A write cycle on the PCI bus 24 is initiated to an address in the address buffer 48a. D A 10. The address inthe address buffer 48a is invalidated. A E 11. A read cycle on the PCI bus 24 is initiated to an address that is not in the address buffer 48a OR 12. A snoop to the caches 16, 18 does require a writeback operation. E A 13. A new address is storedin the address buffer 48a AND 14. New data is stored in the data buffer 48d AND 15. Byte enable information is stored ______________________________________
It is understood that the invention described herein can take many forms and embodiments, the embodiments described herein are intended to illustrate rather than limit the invention. Further, the bus configurations, bus sizes, cache line sizes,peripheral devices, and other details of the above description are only meant to illustrate the invention. Further still, the techniques described herein may be utilized in a computer such as a desktop, laptop or tower computer, as well as in a varietyof other electronic data circuits. Therefore, variations may be made without departing from the spirit of the invention. Accordingly, it is appropriate that the appended claims be construed broadly and in a manner consistent with the scope of theinvention.
* * * * *