Back to Blog

Eric Wright

Pistachio: Understanding the Value of Data Locality in Cloud

Even in the cloud, the old real estate adage holds true. Three things matter: “Location; location; location."

Moving your applications to the cloud is often touted as the panacea. There are a lot of architectural decisions that have to be made when thinking about cloud-friendly application designs though. One if the big issues that organizations find out the hard way is about data locality in relation to the compute layer. As we move more and more content to the cloud using the scale-out capabilities of the front-end, what about the data goodness that is driving all those applications?

Baby got Back-End

Containers are everywhere. PaaS is everywhere. N-Tier application architecture is widely used. Microservices architecture is becoming all the rage. What is common among all of these is that the back-end data that is driving these applications needs to be stored and scaled in conjunction with the front-end application.

The folks at Yahoo Engineering have figured this out, which spawned an interesting open source project they are hosting called Pistachio. The idea behind Pistachio is to ensure that the data is always located close to the compute layer for faster processing. With Big Data and NoSQL becoming rapidly adopted, many organizations are deploying with blinders on as to the negative effects of de-coupling.

De-coupling the data layer is important for scalability and resiliency. That being said, it shouldn’t be done at the risk of performance and usability. These qualities don’t have to be mutually exclusive if the right architecture is put in place for your application infrastructure. This is why Pistachio is interesting.

Consciously Uncoupling Data and Compute

It became a bit of an internet meme when we read that Gwyneth Paltrow and Chris Martin had decided to separate in a way that they described as conscious uncoupling. This same phrase could be used when we describe the way that the Pistachio architecture works. Although the data layer and the compute layer have been separated, it is being done in a way where there is continuously a closeness to data and compute when and where it is needed the most.


Using a well-coordinated collection of sharding, replicas, and caching, the data is quickly cached, acknowledged, and stored. All the while, being placed with the intention of maintaining data locality to the compute workload for fast access. This is all pulled together to provide the advantage of speed, and the resiliency of distributed Pistachio deployment. This may not be making the fast moves in adoption just yet, but I would bet that some folks will be bringing Pistachio into their environment for testing.

As we keep diving down the road with Big Data discussions here at about:virtualization, interesting stories like this pop up and remind us that this is a hot area of the technology market. Big Data may still be misunderstood, but those who do have the use-case are making some strong inroads into innovating the platforms for wider consumption.

Image source: