Sunday, September 27, 2015

Cache side channel attacks

Cache side channel attacks

Errata 5.11.2015:
As I expected a few errors and less than accurate formulations got into this post. There is two things I’d specifically wish to address here.

My comments on using a non-inclusive cache hierarchy to defeat cache attacks are misleading at best. Obviously since the L1 and L2 caches are on a per core basis. It’s the L3 shared nature combined with the non-inclusive cache hierarchy that allows us to flush these caches from another core in the first place. However these caches are relatively small so that with only a “small” amount of memory activity by the victim or other code running on the same core (this includes the attacker if she can get herself executing on that core) rare events can still be effectively spied upon.


On using performance counters to detect cache side channel attacks I focused only on low frequency attacks in this text. Mr. Herath and my slides from Black hat does not. The idea is that high frequency cache attacks is relatively easily identified as such from performance counters (either through the simple amount of cache misses or say a ratio of cache misses to cache hits). Alternatively machine learning methods could be applied to such performance counter data. The low frequency data is much more difficult to keep apart from noise generate by normal usage of memory in a busy system. Say the attacker is looking for keyboard events. For this he is polling every .5 second and have say 15 good addresses he polls. That gives us 15 cache misses in .5 second periods (possibly even spread out over the .5 seconds) which isn’t much if a video encoder is causing thousands of misses also going up and down periodically as the video encoder switches between motion search (large number of cache misses and lots of hits) and entropy encoding (few hits and misses).  This is where the approach I describe here can filter the noise and force the attacker into territory where we can with confidence detect him.

Prelude

No tech stuff in this part here. Skip to introduction to read the actual blog post. If you care about my ramblings do continue.
It’s been 8 months awesome months since I started my comeback into infosec with this blog. It’s time to do status. A few times I’ve felt misunderstood. The most common misunderstanding is that I have a different concept of research than many in the infosec community. This blog isn’t research. It is play. I do it for fun and games and that’s it. There is no intend on my side to do research, because that requires me to be systematic about it and that again is outside of my time budget. I hope from time to time to present a new idea here, but that’s the loftiest I wish to get with this blog. I do hope that the ambitious infosec newbee might be inspired to look into some of the things I write about and I also hope that the infosec professionals occasionally enjoys a lunch break with my musings.
The most read post so far was the nostalgia post. Followed by anything row hammer, followed by anything with machine learning in the title. Ironically the post I like the best on machine learning doesn’t have those words in the title and isn’t being read much. The post nobody cares about is about H.a.r.e.s. which I think is a shame though I admit is not the most relevant, but H.a.r.e.s. sounds like such a cool idea. The LangSec post was the biggest surprise to me: It’s quite well read.
As a personal note – I find the best infosec blogs is by far http://lackingrhoticity.blogspot.com/ by Mark Seaborn and http://www.malwaretech.com/ by MalwareTech. Mark’s blog is really inspiring on a technical level and MalwareTech’s joy of finding things out is contagious to me.
For now I’ll go offline for a couple of weeks and hope to spend a lot of time with a print out of an article about using performance counters to reverse engineer caches.

Introduction

This blog post benefitted from conversations with some really smart people . I’d like to extend my thanks to them.  Nevertheless this blog is probably full of errors and they are mine alone. I’ve not released source codes because the topic is mostly offensive and the source codes would be more useful for offensive purposes than defensive. If you for some reason think my sources could be useful, get in contact (I’ll be offline the next 3 weeks, but I will get back to you). FInally I'd like to apologize that this post got rather rushed at the end.Infact I'm writting this with a plane to catch. Particular damage was done to the different timers section where I really would've liked to add more information.


Cache side channel attacks have been around since 1996 or so perhaps even longer. Recently they had a renaissance in the literature because of the (re-?) emergence of a 3rd level cache . (The 3rd level cache is identical to the last level cache on systems that have 3 cache levels - this is worth to note when reading the literature). The 3rd level cache on modern intel processors is shared between cores and this effectively allows a program at any privilege level to spy on programs running on other cores simultaneously without being concerned about sandboxes, privilege level and to some degree even through hypervisors. The sheer size of the new 3rd level cache also allows for spying being more accurate than with previous caches. The usage of cache side channel attacks on the L3 cache has been show to be able to successfully allow you to track mouse cursors from a java script on a webpage, grab keyboard input from other users on the same system and grab cryptographic keys on co-located virtual computers in the cloud. It is thus not entirely an academic pursuit.
There are currently three different kinds of cache side channel attacks. I shall write about them all to some extend here so I’ll introduce them in short before I go into a slightly more detailed description of how the cache works.
1.       Evict + time
The attacker measures the time it takes to execute a piece of victim code. Then attacker flushes part of the cache, executes and times the victim code again. The difference in timing tells something about whether the victim uses that part of the cache.

2.       Prime + probe
The attacker now accesses memory to fill part of the cache with his own memory and waits for the victim code to execute. (Prime) Then the attacker measures the time it takes to access the memory that he would carefully placed in cache before. If it’s slow it is because the victim needed the cache and this gives us knowledge about what victim did. (Probe)
3.       Flush + reload
The flush and reload attack utilizes that processes often share memory. By flushing a shared address, then wait for the victim and finally measuring the time it takes to access the address an attacker can tell if the victim placed the address in question in the cache by accessing it.

A short introduction into level 3 cache

So let’s spend a bit of time considering how the level 3 cache is build. I’ll use details from the Sandy Bridge as an example, but it’s mostly similar on different intel architecture. The details however can be important in implementing an attack and defense. I use the sandy bridge for 3 reasons: I only have access to a sandy bridge (details and sizes are based on my wife’s computer), it’s the best researched in the literature and it’s likely to be the simplest.
At some point in the past CPU’s became much faster than memory causing CPU’s to be forced to stall when waiting for memory. Because memory accesses typically concentrate on a relatively small amount of memory at any given time adding a cache of super fast (and expensive – in terms of CPU real estate and money) memory directly in the CPU gave a large performance gain for only a small price. Thus the first level cache was added. The later as CPU’s turned faster and software more memory hungry a slightly less expensive a second level cache was added – real estate prices in the CPU drops with the distance to where magic happens.  And finally a 3rd level was added, which is a lot slower that L1 and L2, but still much faster than memory and comparatively huge.
L1 and L2 cache operations are slightly more advanced than L3 due to the fact that they are per core, so a messaging system exists to keep L1 and L2 coherent between cores. I’ll not discuss that here since my focus is the L3 cache, but suffice to say they are otherwise similarly build. There is however one design feature by intel which is important: The cache hierarchy is inclusive. This means that if memory is loaded in L1 it’s also loaded in L2 and L3. If it’s loaded in L2 it’s also loaded in L3. This is intel specific – AMD K7 for instance also has a cache hierachy but it’s not inclusive. For an attacker an inclusive cache makes life much easier. If he can flush memory from L3, he will automatically flush it in all caches.
When a CPU vendor is building a cache they need to track what part of the memory is in the cache and what’s not. This gives rise to a trade off between how much memory you spend for managing the cache, how flexible the cache is and how fast you can do your look up. The level 3 cache on Sandy bridge balances these trade offs in the following way. Sandy Bridge’s L3 cache and many other CPU's is a so called N-Way set associative cache. This means first the memory is divided into a small number of blocks and each blocks can be mapped to a particular set in the cache.  Each set can store any memory location within the block in N different places in the set, N being a small integer. For Sandy bridge L3 has either 12 or 16 depending on the exact model. It is implemented this way as a compromise between the two extremes: The first is called direct mapping meaning that each memory location has only 1 location where it can be stored in cache making for quick look up, but fierce competition of different parts of memory to be stored in this magic location - it is the N=1 scenario. The second extreme is called fully associative cache and means that any memory location can be cached in anywhere in the cache, the N=cache size scenario (only 1 set). While this makes for a very flexible cache, it also requires searching the entire cache for a match. A good way to think about it is that sets are indexed and ways are searched when looking for a cache hit. Also to further reduce the problem the cache isn't working on single bytes, but on 64 byte chunks called cache lines. Finally Sandy Bridge and many of subsequent CPU's have many cores and each core has a part of the cache located next to it, called a slice. The slices are connected to a ring buffer (intel calls it QuickPath Interconnect) so that any core can access the slice of another core, but only at a penalty. This is a compromise for fast lookup on cache belonging to the core, and access to the entire cache for all cores. Graphically this looks something like this:



Summary example for  my wifes Sandy Bridge: Any given byte in memory can be cached as part of a 64 byte cache line. This cache line is cached in exactly one slice which belongs to a core. In this slice it can be an occupant of exactly one cache set out of 2048. Within the cache set it can occupy any of 12 ways. 2 (slices/ cores) * 2048 (sets) * 12 (ways) * 64  (bytes per cache line) = 3 Mb cache total.
Let's take a look at it differently. A request for data is send from the core 1 down towards the memory. We start where one request has arrived at the L3 cache:
1. First part of the address is used to index the correct slice. If we are lucky it's likely to be on the current core's slice otherwise we pay a small penalty for using the ring buffer to access the slice of core 2.
2. Other bits of the address is now used to index the set to which the address belongs.
3. Finally the N-ways is searched for our address. If it's not found the cache returns a miss  and the request will be forwarded to memory.
4. If  the requested  memory is found in the cache and the cache returns the relevant bytes from the cache line using the lowest 6 bits of the address to index into the cache line.
For the details of how the bits of the physical address of the request maps to the cache see Seaborn(2015).

Eviction
The cache is filled either through memory reads or through the prefetcher. When you read memory very often you’ll shortly thereafter either read or write the same memory or memory in the same cache line. Thus reading memory usually causes the memory you just read to be stored in the appropriate cache set. The prefetcher looks for patterns in your memory access and tries to move cache lines from memory to the cache before you need them. Either way a set is going to be full at some point and to keep the system up to date memory needs to be evicted from the cache. The method by which is determined what cache lines are no longer needed in the cache is called the eviction policy and the act of dropping a cache line is called to evict – or sometimes flush. The first algorithm to come to mind is to drop the least recently used cache line to make space. This is not surprisingly called a LRU eviction policy. The problem with such a policy is that you need to keep a track of the order that cache lines where accessed within the cache and to evict the least recently used. This again requires you to use n*log2(n) bits (where n is the number of ways) and search through the bits to figure out which one to evict. This can  be done in different ways, but it’s always computationally intensive. With bits and time being expensive in the CPU, Sandy Bridge opted for a so called pLRU or “Pseudo least recently used” eviction policy. Instead of accurately keeping track a simple binary tree is represent with a suitable number of bits made to represent the cache set so that each leave is a way in the set. On access the bits are flipped in nodes of the tree that where used and for eviction the opposite path is followed to figure out which way is to be evicted. pLRU is not only faster algorithmically since only the nodes needs update, but also require less bits. The downside is that it only approximates LRU. To read more about eviction policies you may like to read this Jimenez (201x). More modern CPU’s than the Sandy Bridge has more complex (adaptive) eviction policies. We can now figure out that if we wish to remove other programs memory from the cache we can  do so by systematically load our own memory from each cache set. Theoretically for Sandy Bridge we need only N addresses (12 or 16) which belong to a cache set of interest and load each of those twice and the cache will be filled with those addresses. Real life experiments suggest that a little more than twice provides better results. The interesting thing is that it’s quite a doable proposition. In fact it’s so doable that even without knowing the actually addresses we can figure them out by using timing and thus do it from java script without pointers. Oren et al (2015) does this. A detailed discussion of optimally filling cache sets is giving in Gruss &Maurice(2015)

Cache side channel attacks

Now that we have a partial overview of how the cache works we can turn our attention on how the 3 before mentioned cache attacks works the cache to gain information.

Evict + time

In the beginning of this blog I wrote: “The attacker measures the time it takes to execute a piece of victim code. Then attacker evicts part of the cache, executes and times the victim code again. The difference in timing tells something about whether the victim uses that part of the cache.” We can now fill in the blanks. If we know which cache sets the victim is likely to use we setup the cache as to give us an idea about what memory the victim used. If we run the same victim once we know it’s likely to have cached part of the memory it used. Now we can time how long it takes to run with a high resolution timer to give us a baseline. Now we load our own memory systematically so that a number (in which we are interested in terms of the victim) of cache sets becomes full with our data –now we can assume that anything the victim placed in the cache has been evicted. Then we let the victim run again and compare how long it took to our new baseline. We can repeat this process for different cache sets until we’ve figured out which cache set he used. If branches are sufficiently large (larger than the cache line size) we can say which parts of the code the victim uses. If memory structures are bigger or displaced across cache lines we can say which memory structures he used. This information can be used to draw inference on information private to the victim. Typical examples are attacking AES that is using tables for encryption with the key being private to the victim or KASRL where actual addresses are hidden from view. We should notice here that other tasks using the cache will add noise to our measurements so repeating the attack again and again will improve accuracy as on multi core system there is always noise on anything to do with memory. Also the prefetcher is an enemy to be considered as it might load memory in anticipation of being used by the victim without the victim actually using it and thus giving us false information. In practice these things makes attacks less reliable, but by no mean infeasible. The real down side to this attack is that we need to be able to cause execution of victim code which might be difficult in the first place. The positive side of this attack is that we can apply it in java script or native code since we need not know any addresses from the victim to carry out the attack – the actual cache sets used is all the information that is required.

Prime + probe
The Prime + probe attack looses the restriction of evict + time that the attacker needs to be able to start execution. To make this happen we again make sure that interesting cache sets are filled with memory we control. Once this is the case we’ll wait a predefined amount of time to see if a victim uses memory in the region covered by the cache set. If he does he’ll cause one of our addresses to be evicted. This is the prime phase of the attack. The probe phase is now to access each address we originally filled the cache set with. If this is slow, the victim has used memory with in this cache set as well. Probing fast enough we’ll get a series of observations over time of what cache sets are being used on the computer. This allows us to draw lots of inference on what the victim is doing at any time. Typically to map cache set activity the attacker runs lots of attacks on a system under his control to generate a baseline data set. He then uses a machine learning method to map the time series he got from the victim to figure out what the profile means the victim is actually doing. Prime+probe is in my opinion the most powerful of the cache side channel attacks. Again we don’t need actual addresses so we can make it work in java script. We don’t need to actually cause execution of the victim code so just need to loiter around – which is one less prerequisite for the attacker. Thus this attack vector is suitable for almost any situation where we get access to running our own code on the same computer as our victim. Hypervisors, privilege levels, sandboxes and other barriers are irrelevant because we all meet up in the cache. The down side to this attack is that the data we get back tends to be very noisy. We only get which cache sets where use (and partly how many cache lines where loaded), but we have no initial information about which core, thread or even which code cause the access. So to successfully we need to repeat the process a lot and we can spy only on events that are relatively common. Oren et al(2015) covers this kind of attack in more details.

Flush + reload
Flush+reload over comes evict+time requirement of causing execution as well as prime +probe noisy data. It does this by assuming that victim and attacker are using the same memory. This proposition at first seems weird since even in unsafe operating system such as windows 95 each process is given it’s own memory space. But it’s not as weird as it may seem. This is because of a memory management feature by most operating system called page dedublication. If two identical memory pages are loaded in separate processes, operating systems can save physical memory (and more more efficient use of the cache) by mapping the same page in each process. Most operating system does this – even across different users and privilege levels. Typical example would be the code of kernel32.dll in windows. It’s shared by processes but most of the DLL (especially read-only-portions) such as code is only mapped in physical memory once. Should a process decided to write to such memory most operating systems incorporate a feature called copy-on-write, essentially the OS sees to it that an exception is thrown on write access and then makes a new copy of the page for the offending process only before retrying the write operation. Page dedublication is rampant in all modern operating system and even some hypervisors are capable of doing it. The attacker now uses this information by flushing a specific address shared by attacker and victim. Because he needs to flush a specific address this attack seem impractical in java. In native code it’s quite easy to flush an address using the clflush instruction, but there is also no reason why he shouldn’t be able to do it using the normal eviction policy of the cache. Once the address is flushed the attacker waits an interval and then times the access time to the address. If the victim used the address it’s likely to be in cache and thus the access will be fast. Since we know exactly what address we are talking about we have very little noise. Flush+reload can be used as a tailored attack or by comparing what happens on a machine under the attackers control with what happens on a victim computer using machine learning methods. The latter called template cache attacks. The best article on flush+reload attacks is probably: Gruss, Spreitzer, Mangard(2015)

Current counter measures to cache side channel attacks


My general opinion on security problems such as CSC’s is that we should fix the root cause and in this case the root cause is the micro architecture of the cache. The core problem however is the cache being shared across privilege levels and that might be a significant hurdle to climb for CPU designers. Also there is already plenty of existing hardware in use so even if intel where to fix the root problem we’d be stuck with these issues for a considerable amount of time unless we look at other options.
1.       Less accurate timer
Most cache side channel attacks use the timing of a memory access to determine if it’s cached or not cached. This usually works through the RDTSC/RDTSP instructions to read the current time stamp counter of the processor or an operating system/library wrapper thereof. For javascript this can be implemented directly in the wrapper and has so far this has been suggested and already implemented in response to Oren et al.(2015) Obviously if the time is sufficiently coarse we cannot tell a cache hit from a cache miss apart. For user mode attacks RDTSC(P) can be disabled in user mode through a bit in CR4, which will cause it to make an access violation. This again can be used to emulated the instruction with any coarseness we wish, leaving compatibility with most existing software. Now this approach is never going to be perfect. The reason is, that if we start a timer at a random point in time and then carry out a flush or a probe the probability of seeing an increment reading when reading the timer again is increasing in the time the flush or probe took. Thus if we can sample enough events we’ll be able to draw inference what is happening – that is spy. Say keyboard presses are not such an event type as they’ll be more or less unique events – we can’t just ask the user to type his password 10000 times. Where as blitting a mouse cursor touches multiple cache lines at once and in neighboring positions as it’s moved we could probably get a good idea about where it is even with a middle coarse timer. One approach to avoid this for attackers would be using performance counters to count LLC cache misses instead of relying on timing. levels allows for it. Also on virtual machines performance counters are often not implemented. OS and virtualization implementers however should be aware that performance counters provide a powerful tool for reading into side channels even when you only get your own performance data because of sharing of resources in the system. I’ll return to timing later.
2.       Non-inclusive cache hierarchy
Obviously this does nothing against code using the CLFLUSH instruction, but people trying to evict the cache through filling a cache set now need to bother about flushing all levels in the hierarchy separately. Obviously this makes it more difficult but I don’t think it’s a deal breaker – since we can relatively easy evict a cache set in a 3MB cache, it seems like evicting one in a 32 kbyte cache should be a feasible task. I’ve not read about cache attacks on AMD k7, but I’m fairly pessimistic that despite the non-inclusive cache hierarchy that CSC’s are possible on this processor too.
3.       Fully associative caches though they may not hold they same performance, they do make it very difficult to sufficiently fast evicting the cache without the use of CLFlush instructions as we’d essentially have an extreme amounts of ways in our eviction set. Again this would only be a solution on new hardware and to really close off the gap the Clflush instruction should be made priviledged. There are other reasons – row hammer – to follow such a path.

4.       Disable cache-line sharing
Flush + reload attacks can be dealt with effectively if there is no shared memory. This ranges from an operating system/ virtual machine level disabling page deduplication to actively causing copy-on-write on particularly interesting pages. On Windows 7 a simple ReadProcessMemory() followed by WriteProcessMemory() is enough to trigger a copy-on-write and thus disable Flush+reload – you don’t actually need to change memory! There is other reasons to reconsider page dedublication: The copy-on-write takes times and opens up for a side channel attack by timing write operating to pages. I believe Daniel Gruss wrote a paper on the subject, but I could be mistaken.
5.       Accessing and flushing memory “randomly” by a victim or operating system obviously adds lots of noise to the process. I love the irony of using cache side channel attacks to mitigate/detect cache side channel attacks – but despite this irony there is obviously a heavy performance penalty associated with causing cache misses to protect against attacks and in scenarios where the attacker is making many repeated attempts this is likely to be too big.

6.       The intel CPU tries to load stuff into the cache before it’s used prefetching. It’s especially useful for code and reads which are predictable like memcpy(). Obviously the prefetcher continually causes cache hits even when nobody already accessed the memory thus making more difficult to get good data on memory activity for spying purposes. However I find it unlikely that we’ll see a prefetcher that’ll be sufficiently good to do land a very strong blow on the possibilities of an attacker.

7.       Also it’s been suggested to modify software so that each user has a version with a unique memory layout and access path. Imagine a powerful mutation engine at work. This has been suggested as a remedy against code reuse attacks in the past and is likely to make CSC quite a bit more difficult. It is also quite feasible as such a mutation engine was used in a successful virus by Z0mbie back in 2005. The downside is of cause that mutation engines of any kind posses a Q&A problem to software developers. Certainly this method has not yet gained traction in fighting against code reuse attacks so the probability of it gaining tracking for fighting CSC’s seem slim.

8.       Using performance counters is also an option for detection. I suggested in mine and Nishad Herath’s black hat slides to instrumentalize RDTSC(P) or other high resolution timers to first make them less accurate and second to  start counting cache misses using the intel performance counters. This forces a spy into territory where he has to look at more cache misses to get reliable information. This combined with the presence of a high resolution timer makes a very good heuristic for cache side channel attacks. Especially since one potential response would be to make the timer even less accurate – thus only mildly impacting the system as a whole in the event of a false positive. In my experiments false positives have been rare though low-frequency spying sometimes ends with false negatives.  I imagine that continued spying as well as tuning with the coarseness of the timer could make it a viable solution. However let’s take a look at timing from an attacker perspective. Even if the attacker manages without a high resolution timer, it might be well worth looking into performance counters to detect cache side channel attacks – because any high-resolution-timer is likely to leave uncommon artifacts in the performance counters.

Timer issues


From the above it looks like the attacking the timer is the most promising defensive technique as it deals with all three methods of cache side channel attacks. It’s feasible,  relatively easy to implement, at least if we don’t consider Patch Guard and though it might not give 100% piece of mind it at least goes a long way in the right direction. However there might be other ways to get a high resolution timer on the x86. I’ll suggest two such methods here, others may exist:
1.       Delta timer. Using a low-res timer, then wait for the moment when it increments and then do your measurement and try with register only instructions to fill the gap so that a cache miss will exactly cause the timer to increment and a cache hit will not. I call this method of timing a delta timer. Probably somebody in the literature has assigned another name and if so please contact me and tell me. On old single core CPU’s register only instructions had deterministic timing. This is not the case on multi core CPU’s and this makes it a bit less reliable. The figure below shows a graph of 30 observations from my Sandy bridge trying to when trying to time out 8000 nano seconds. This suggest that if we need to use a delta timer to get 95% correct we need at least 10 cache misses to be sure there was cache misses at all. This brings it into a range where the performance counter detection method would work well. However If we accept an error rate of 1/6 the delta timer might do the job with while being difficult to detect using performance counters.  Now remember that this is not a real scenario where we have a lot more noise already. But it’s unfortunately not infeasible that delta timer would work in many real world scenarios – especially for flush+reload since flush+reload is a lot less susceptible to noise anyway than the other methods. But who says the coarseness of the timer should be constant – with a random value (random drift?)the delta timer would be difficult to use – that again is a different scenario than what has been proposed for java scripts.
2.       Thread timer. I’ve dubbed this kind of time “Thread timer” without knowing if there is any literature on that subject. The idea is to have a thread running and by a signal through a global variable start counting. When the global variable is reset it stops counting. The count then serves as a high resolution timer. There is problems in implementing such a timer. Mainly accesses to the global variable is not automatically perfectly synchronized leaving a non deterministic (and significant) time gap from the timer is set to start to it actually starts – and for stopping the same problem exists. Now intel does provide instructions to deal with this – mfence and the lock prefix – however I’ve not investigated those as I figure they’d be difficult to generate in java (I find the java attack vector the most threatening). For native code they are likely to bring around improvement, but not determinism. The first thing I noticed while experimenting was that I got absolutely random results until I started assigning the thread to a specific CPU (core or hyper thread). The behavior is very different on hyper threads and real cores. The unit for the thread timer is the number of times ecx was incremented during 1 uncached read and for 10 Cache misses. The curve for hits look similar. You may notice that it’s shorter than the delta timer. This is because I’ve deleted a handful of observation with the value 0 – that is where the thread timer didn’t start at all! When timing for only one cache miss I get only a few observations with counts. This hints at that we need at least a handful of cache misses to be able to tell misses from hits with a thread timer. Obviously this is not the last words on thread timers, but it hints at that they are fairly reliable on hyper threads and a little less so real cores and that thread timers are unsuitable for spying on events that gives only a few cache misses as feed back. The thread timer code I used is listed below. In short my suggested use of  CR4 to reduce resolution does pose a problem to an attack looking for rare events with only a few cache misses – say key presses.
while (g_Start == 0);
             g_Time = 0;
             __asm
             {
             WaitLooo:
                    //mfence
                    cmp g_Start, 0
                    jz WaitLooo
                    xor ecx, ecx
             again:
                   
                    inc ecx
                    cmp[g_Start], 0
                    jnz again
                    mov[g_Time], ecx
             }

       g_Start = 0;


Maybe I drink too much

In private conversations I’ve suggested in the past to use cache side channel attacks to detect rootkits including SMM root kits. My suggestion goes like this:If we have a root kit type that tries to read usermode private keys in memory. The possibility exists that we can trigger this behavior and use it for an evict + time attack. Imagine an SMM with a hook on SomeOpenPrivateKeyFunction() which reads the input irrespectively of validity and that our input exceeds a cache line but that SomeOpenPrivateKeyFunction ()won’t read past a syntax error in the key. The scenario sounds farfetched but maybe not too farfetched. Then setup a private key the size of two cache lines with a syntax error in the first cache line. Flush the second cache line and then call  SomeOpenPrivateKeyFunction(). Now time accessing the second cache line of the fake private key. If SMM touched it chances are it used the cache and we’ll be able to see the time difference. Obviously there are plenty of things a root kit could do to mitigate, but at least we started a game of hide and seek.
Another interesting (mis-)use for cache side channel attacks would be detection of undesired runtime environments. Emulators and VM’s could be detected. Emulators could be detected because it’s quite difficult to emulate a cache correctly and most just plain skip it. VM’s could be detected by bootkits through the fact that the cache is likely to be full under a VM, but on a real system it should be empty on boot. I’m sure other cache based methods would be available.

 

Conclusion

Cache side channel attacks certainly have a high coolness factor which is why I bothered to write about it. On the surface of it spying on another VM from a sandboxed java script webpage – being able to exfiltrate cryptokey,key strokes, mouse movement etc. sounds like the perfect storm. Add to it that(in my opinion and limited knowledge) nobody found a good strategy for mitigation yet.
Certainly cache side channels poses a real danger to systems. But cache side channel attacks won’t be responsible for a buffer-overflow-type wave of new breeches. The value of figuring out private keys or reducing the entropy of KASLR is just so much lower than a traditional breech. In the face of the mitigation options already available cache side channel attacks becomes even further reduced. Attackers can only figure out private keys if they are used often, attackers cannot really spy on key strokes with reduced resolution of timer. Java script spying is made more difficult and with it’s level of noise it spyes only on repeated events and only until somebody closes the webpage. However the message should be: On sensitive systems these mitigations should be implemented – especially the lower resolution timer and dedublication measures. And that java is already implementing a lower timer resolution is a good thing, even in the presense of other timers and even though I to some extend would’ve done details in different ways.
And finally just because there currently are no good mitigations on the table certainly doesn’t mean there isn’t one and even less than perfect mitigations raises the cost/benefit ratio for an attacker. I for one would love to spend more time with performance counters. I just have to wonder why kind of quirks thread timers and delta timers have in the microops world of performance counters.

Literature
Oren et al (2015): “Spy in the Sandbox, Practical Cache Attacks in Javascript”; Yossef Oren, P.Kemerlis, Simha Sethumadhavan and Angelos D. Keromytis. arXiv: 1502.07373v2
Gruss, Spreitzer & Mangard (2015): “Cache Template Attacks: Automating Attacks on Inclusive Last-Level Caches”
JimĂ©nez(201x) : “Insertion and Promotion for Tree-Based PseudoLRU Last-Level Caches”; http://taco.cse.tamu.edu/pdfs/15-jimenez.pdf
Seaborn (2015): “L3 cache mapping on Sandy Bridge CPUs“; Mark Seabrorn. http://lackingrhoticity.blogspot.de/2015/04/l3-cache-mapping-on-sandy-bridge-cpus.html

Gruss &Maurice(2015): “Rowhammer.js”; Daniel Gruss, Clementine Maurice. http://arxiv.org/pdf/1507.06955v1.pdf

2 comments:

  1. Great article. I do not quite follow why the Delta Timer requires the performance counters. Once we are able to differentiate between a cache miss and a cache hit using the register only instructions, aren't we done?

    ReplyDelete
    Replies
    1. Low frequency attacks can easily lead to false positives or false negatives detection with performance counters on machines that are not idle. My solution to this problem was to cause RDTSC(P) instructions to throw access violations (through CR4) and start/stop performance counting on these. This essentially blends out all the noise that other processes creates. This scheme can be broken if the attacker uses another high-resolution timer such as a delta or thread timer. This is what I wished to convey. Hope it clarifies things. Else write me an email or contact me on twitter and I'll see if I can explain it better.

      Delete