Latch Free B-trees (ASE 16sp02 MemScale option feature)
First, a bit of history. Latches are essentially a form of concurrency control that is a companion to locks in a lock based concurrency control implementation. However, where locks are transactional - e.g. if you grab a lock, you hold it until the end of your transaction - latches are just for the duration necessary to accomplish a discrete step. While latches are used elsewhere in ASE, they are most commonly associated with indexes on DOL tables.
There is a bit of history about that as well. Back in the old days - before 11.9.2 - ASE experienced a lot of contention and deadlocks between applications. Engineering did a bit of a study and noted that the leading cause of the contention/deadlocks wasn't the data pages themselves, but rather the indices - and specifically the index leaf nodes. The reason why was that if two users were attempting to insert different rows into the same table - even though scattered - often an index key on a datetime field or sequential key number column would incur blocking at the index leaf page level. As a result, in 11.9.2 along with datarow and datapage locking, engineering eliminated locking on index pages and replaced them with non-transactional latches. In early documentation, it was mentioned that this would reduce contention as a process would only grab a latch on an index page long enough to make the necessary modification and then release it immediately without waiting for a commit.
There are two types of latches - read or shared and write or exclusive. Similar to a lock, when you simply need to read an index page, you grab a shared latch (SH_LATCH) on it. When you need to change an index page, you grab an exclusive latch (EX_LATCH) on it. That way, when reading an index page, the page is consistent throughout the read as no one is changing a value right when you are trying to read it. Similarly, if you need to change an index key, you can make your change without concern whether someone else wants to change the same key or not.
Overtime, however, as transaction volumes have exploded in the near 2 decades since this was implemented, we are once again seeing index latch contention as scalability bottleneck. There are two ways to detect this. The first is in sp_sysmon in the context switching section. The second is in monSysWaits or monProcessWaits (preferred) where high Wait values for WaitEventID=41 exist. Latch contention can generally be divided into two types:
- Write/Write contention - two writers collide
- Read/Write contention - a writer is blocked by a reader or vice versa
The big question is how do you find the source of the contention. There are several likely sources of latch contention - the first and least likely is on allocation pages. If you see a lot of latch contention with APL tables, then it could be this case in which you might need to look into dbcc tune(greedyalloc). The underlying cause is a lot of concurrent space allocations for different tables - which may be controllable by changing the number of preallocated extents. However, this is the least likely source of latch contention.
The second is on datarows locked system tables - specifically the indexes. While a bit more likely than the former, the most common cause of this is highly concurrent object creation/deletion or similar activities which due to indexes on columns such as "id" result in contention. The most likely case might be highly concurrent create #table commands in tempdb. The only real solution for this is likely to use multiple tempdbs. As with the first, the likelihood of this being the cause is fairly small.
The third and most common cause of contention is on indexes for DOL locked tables. Unfortunately unlike locks, latches are not tracked in monLocks nor syslocks. As a result, there is no easy way to find the exact points of contention. However, there is a bit of a trick that can likely find the candidates quite accurately.
First, we have to understand indexing 101. Indexes are always maintained in sort order. Consequently, indexes on columns with monotonically increasing (or decreasing) values will have all of the index insertions at the last page. Think trade_date, order_number and similar columns. The second indexing basic is that we only attempt to store an index key one time. For nonunique indexes, duplicate key values being stored once with a list of RIDs. This fact is a bit hard to find, but is located in section 4.6.1 Factors that can Affect Storage Size of the ASE Performance and Tuning Series: Physical Database Tuning manual in a bullet that states:
- Duplicate keys are stored only once, followed by a list of row IDs for the key.
Now this is interesting - if a new row is added with an existing key, that means the RID list needs to expand. If ASE used row level latches for index access, this would result in index row forwarding and real performance nightmare. However, with page latching, a process can expand the RID list by simply rewriting the rest of the page by shoving the subsequent index rows down.
So there we have our main latch contention causes:
a) concurrent updates/appends to monotonically increasing index key values
b) concurrent updates/appends to duplicate index key values
c) any attempt to read an index page during any write operation on same page
d) any attempt to write to an index page during any read operation on same page
Now then, lets outline the process to finding candidates:
1) Verify that index latch contention is a fairly big problem (top 5 in WaitEvents or context switching)
2) Find the most volatile indexes that have high logical reads
3) Obtain the index keys for those candidate indexes
4) Validate that either
a) a monotonic sequence is likely, or ...
b) highly duplicative keys are likely, or...
c) there are a lot of range/covered scans
Once verified, you have a choice of fixing the index, dropping the index or converting to a latch-free b-tree index.
Lets walk through an example a colleague sent me.
The customer complained of a slow system - looking at sp_sysmon, you can see CPU is fairly low:
Engine Busy Utilization CPU Busy I/O Busy Idle
------------------------ -------- -------- --------
Engine 0 28.7 % 0.5 % 70.8 %
Engine 1 23.9 % 0.8 % 75.3 %
Engine 2 27.9 % 0.8 % 71.3 %
Engine 3 26.5 % 0.5 % 73.0 %
Engine 4 26.7 % 0.8 % 72.5 %
Engine 5 25.6 % 2.0 % 72.4 %
Engine 6 23.3 % 0.6 % 76.1 %
Engine 7 21.2 % 0.8 % 78.0 %
Engine 8 27.1 % 0.8 % 72.1 %
Engine 9 22.7 % 0.6 % 76.7 %
Engine 10 25.5 % 0.9 % 73.7 %
Engine 11 28.3 % 0.8 % 70.9 %
Engine 12 30.2 % 0.5 % 69.3 %
Engine 13 21.7 % 0.5 % 77.7 %
Engine 14 24.9 % 1.0 % 74.1 %
Engine 15 26.1 % 1.5 % 72.4 %
Engine 16 30.4 % 0.6 % 69.0 %
Engine 17 25.0 % 0.5 % 74.5 %
Engine 18 23.4 % 0.7 % 75.9 %
------------------------ -------- -------- --------
Summary Total 488.9 % 15.3 % 1395.8 %
Average 25.7 % 0.8 % 73.5
Now this is not that abnormal for a system experiencing performance problems. However, most folks focus on high cpu - but in reality, cpu is an indicator whether high OR low. High CPU can be driven by one of 9 things (LIO, sorting, compilation, spinlocks, etc.). Low cpu is generally caused by slow IO or blocking/contention. For us with sp_sysmon this is easy - we just check task context switch section:
Task Context Switches Due To:
Voluntary Yields 163.1 0.1 97866 3.6 %
Cache Search Misses 46.1 0.0 27644 1.0 %
Exceeding I/O batch size 0.1 0.0 76 0.0 %
System Disk Writes 20.6 0.0 12340 0.5 %
Logical Lock Contention 0.8 0.0 509 0.0 %
Address Lock Contention 11.2 0.0 6730 0.2 %
Latch Contention 1151.9 0.4 691132 25.4 %
Log Semaphore Contention 84.5 0.0 50674 1.9 %
PLC Lock Contention 0.7 0.0 391 0.0 %
Group Commit Sleeps 3.2 0.0 1909 0.1 %
Last Log Page Writes 63.1 0.0 37854 1.4 %
Modify Conflicts 19.6 0.0 11759 0.4 %
I/O Device Contention 0.0 0.0 0 0.0 %
Network Packet Received 631.8 0.2 379107 13.9 %
Network Packet Sent 991.9 0.3 595128 21.9 %
Network services 1577.1 0.5 946244 34.8 %
Other Causes -230.4 -0.1 -138224 -5.1 %
Now this is a poster child. Note that when looking at task context switches or wait events, you can ignore network packet recieved (WaitEventID=250) as this is common. If monSysWaits/sp_sysmon, you can also ignore system related info such as checkpoint idle time, etc. (or Network services above). Then if Latch contention is in your top 5, you likely have a significant problem which needs some attention if you want to improve performance. The trick then is to run a query similar to:
select top 20 DBName,
object_name(ObjectID,DBID) as TableName,
ObjectName as IndexName,
IndexID,
LogicalReads,
RowsInserted+RowsDeleted as IndexMods,
UsedCount
from master..monOpenObjectActivity
where IndexID>1 and IndexID<250
order by 6 desc
Assume we get back a list like:
TableName IndexName IndexID IndexMods
-------------- ------------------------------ ----------- ------------
MessageProcess MessageProcess_Idx1 3 198828
NotficationTbl NotficationTbl_SentStatus_idx 2 96794
UserSession UserSession_CreateTime_idx 3 78623
NotficationTbl NotficationTbl_CreationTS_idx 4 32259
NotficationTbl NotficationTbl_ID_idx 3 32259
The above is a shortened example. One trick (as we will see why later) is to use the LogicalReads to help score which indexes to focus on. However, for simplicity's sake, we will just focus on the number of modifications. Some may question why we only used RowsInserted and RowsDeleted and not RowsUpdated....wellllll....that's because indexes never have RowsUpdated>0. The simple reason is that indexes are sorted and therefore each update is treated like a deferred update - a delete/insert pair - even if the same row (remember, we may have to rewrite the index page when expanding the RID list).
Since the first culprit is the biggest, we simply do an sp_helpindex on the MessageProcess table to get:
index_name index_keys index_description
------------------- ----------------------------------------- -----------------
MessageProcess_PK SourceID, MessageID, EventNum, HandlerID clustered, unique
MessageProcess_Idx1 CreateTime nonclustered
This one is almost a gimmee. "CreateTime"....hmmmmmm....I would be willing to bet that it is very monotonically increasing. Similarly, if we look at UserSession table for the third index above we see a similar situation:
index_name index_keys index_description
-------------------------- ------------------ -----------------
UserSession_PK UserID, SessionID clustered, unique
UserSession_CreateTime_idx CreateTime nonclustered
That is two of the top 5 that *may* be candidates for latch-free b-tree. Now let's take a look at the NotficationTbl table's indexes:
index_name index_keys index_description
----------------------------- ------------------ -----------------
NotficationTbl_SentStatus_idx SentStatus nonclustered
NotficationTbl_ID_idx ID nonclustered
NotficationTbl_CreationTS_idx CreationTimestamp nonclustered
The last two are similar to the first two - likely monotonic sequences. The first one is a bit fun - SentStatus. This likely is very low cardinality - e.g. Waiting, InProgress, Sent, Archive....or similar. An index is somewhat desirable as we likely have a polling process that looks for all the 'Waiting' notifications and grabs the next one. Hopefully using readpast locking. Ignoring that, you can probably see the problem. If we have 10 users concurrently adding messages to the queue, those 10 are all updating the same index row - Waiting...just appending to the RID list. Similarly, if we have 10 users sending messages, all 10 update (first) the InProgress index row RID list and then the 'Sent' index row RID list. Aiyeeeyiyiyi. This is where partitioning the queue into high, medium, low or east, central, west or A-H,I-S,T-Z or some scheme would help.
The big question is whether latch-free b-tree would help here or not.
The answer is in how Latch-Free B-Tree (LFB) is implemented. In ASE 16sp02, the LFB is implemented using a per database mapping table for all index pages that are dirty/hashed into the mapping table. This mapping table contains the address of the most recent modification to the index page from which a reader would need to fold into the original page to derive a page image that could be read. After 16 modifications to the same index page, a delta merge process merges the changes back into the original page. Because of the versioning, the SH_LATCH for reads is eliminated - the mapping table contains the latest completed change to the page. Once a writer makes a modification, it does so by copying the latest version of that row, creating a new version and then using the Compare & Swap (CAS) atomic instruction to overwrite the address in the mapping table to the new address.
There is a bit of a problem though. While SH_LATCH is eliminated, EX_LATCH is not. This has to do as much with the Write Ahead Logging and the PLC queue buffer/unpin problem as anything else. However, it also is reasonable to expect that absolutely true concurrency would be impossible. For example, with a sequential key as a unique index, true concurrency could result in two tasks updating different values (e.g. 3 & 4) the same value (e.g. 5) which would violate index uniqueness. Additionally, with two inserts of two different rows but both with the same status, so long as we are using the RID list, we have an issue. Remember, while MVCC says writers don't block readers and readers don't block writers...writers DO block writers.
Sooooo....remember our sources of latch contention:
a) concurrent updates/appends to monotonically increasing index key values
b) concurrent updates/appends to duplicate index key values
c) any attempt to read an index page during any write operation on same page
d) any attempt to write to an index page during any read operation on same page
The only ones we can do anything about is the latter two (c) & (d). This is more of a benefit than you might think at first. Keep in mind that the implication is a volatile index that is highly read will benefit - and typically, this is often one of the most useful indexes such as Primary Key or date based index as an typical date range scan could impede writes. Similarly all those polls for 'Waiting' statuses could impede write operations. If you think about it, the entire reason for the indexes in the first place was for reads - and frequent enough to drive the need for an index....so it is just as likely to suppose those reads are causing a majority of the latch contention. In fact, in our testing with indexes with high latch contention, we have been able to improve performance by ~2x by using latch free b-trees. Of course for indexes being used for range scans or covered scans this could be even better. Range scans or covered scans can be determined by the ratio of LogicalReads to UsedCount . If <10 then it is fairly much a point query index and not likely much benefit. However if LogicalReads out weigh the UsedCount by 100x and it is one of the more volatile indexes - the greater the effect of turning on latch-free b-tree.