Data Engineering

Data storage patterns, versioning and partitions

When you have large volumes of data, storing it logically helps users discover information and makes understanding the information easier. In this post, we talk about some of the techniques we use to do so in our application.In this post, we are going to use the terminology of AWS S3 buckets to store information. The same techniques can be applied on other cloud, non cloud providers and bare metal servers. Most setups will include a high bandwidth low latency network...

Read

12 Factor Spark Applications

Spark is a distributed data processing engine that is widely used in batch processing and stream processing platforms. Building such platforms comes with a fair share of challenges beyond those required for continuous delivery in the microservices world such as data drift, bad data, and data security. Similar to the Twelve-Factor app that is an outstanding methodology/set of principles for building web apps, we at Sahaj have realized over time that a set of concepts/patterns can be applied to building data processing...

Read