Data Engineering

Smart Workflows: The Key to Boosting Adoption and Fuelling Growth

We are living in an exciting time, right in the middle of a tech revolution. The ever changing technology is helping businesses solve innovative challenges, and at the same time, user expectations are changing faster than ever. To stay ahead, businesses must anticipate and adapt to these changes. Understanding and meeting these expectations is critical because how users perceive and interact with technology can significantly impact their trust and satisfaction.The key to this transformation? —a shift from static experiences to...

Read

Data Quality in the Training Stage of ML Pipelines

Data quality refers to the degree to which a dataset meets the following key requirements for reliable machine learning: Accuracy: The correctness and precision of data values in representing real-world facts Example: Temperature readings from sensors match actual temperatures within acceptable margins. Temperature in Kashmir in January cannot read 30 degrees celsius. Completeness: The presence of all necessary data points and features Example: No missing values in crucial fields. Consistency: The uniformity of data across the dataset Example: Consistent date...

Read

Ensuring Data Quality in ML Pipelines using DVC

Ensuring data quality is crucial for building reliable ML models. In this guide, we integrate data quality checks into an ML pipeline using DVC (Data Version Control) to enable versioning, tracking, and automated validation of datasets before model training.Machine Learning pipelines automate the end-to-end ML process, handling everything from data ingestion to model deployment. They help break down complex workflows into manageable stages.Important stages of the pipeline:Data Quality checkData preprocessingFeature engineeringModel training and evaluationModel serving and monitoringThis article focuses on...

Read

Reproducibility in ML: The Role of Data Versioning

Once you’ve run machine learning models in production, becomes one of the non-negotiable aspects of delivery since the baseline performance of the model has already been defined. The ability to replicate results, debug issues, and improve models requires that all experiments are easily traceable. Unlike traditional software development, where versioning source code is sufficient to reproduce builds, ML workflows can degrade significantly with changes to the data, its distribution as well as changes to any preprocessing steps.In this article, we’ll...

Read

Fine-Tuning Shuffle Partitions in Apache Spark for Maximum Efficiency

Apache Spark’s shuffle partitions are critical in data processing, especially during operations like joins and aggregations. Properly configuring these partitions is essential for optimizing performance.By default, Spark sets the shuffle partition count to 200. While this may work for small datasets (less than 20 GB), it is usually inadequate for larger data sizes. Besides, who would work with just 20 GB of data on Spark?To optimize performance, it’s crucial to determine the appropriate number of shuffle partitions. Here are some...

Read

To ETL or Not to ELT: Choosing the Right Approach for Data Processing and Management

Deciding between Extract, Transform, Load (ETL) and Extract, Load, Transform (ELT) is one of the most important decisions a data engineer has to make. Transforming data before loading (ETL) or after loading (ELT):can shape how efficiently we handle datahow flexible we are in adapting to changing requirementsand how well we scaleAs a data engineer, I've had many opportunities to work with various data pipelines and tools. A commonly asked question is 'Should we use ETL or ELT?'. Both methods have...

Read

Data Drives Innovation and Growth at Australian Fintech

A tailored strategy combining software engineering, data engineering, and ML operations enabled data-led business decisioning for a fintech enterprise. Our client, a leading Australian non-banking fintech providing digital instalments and lending services to customers across Australia, New Zealand, and Singapore, was facing a unique challenge. Having acquired and integrated several small businesses over the years, the company was grappling with a mix of disparate and legacy systems and processes, making it extremely arduous to build critical components around Machine Learning...

Read

Driving Data-led Decision-making through Data Engineering for Superior Business Impact

Springer Nature is a global academic publishing company that advances discovery by publishing trusted research. Following the 2015 merger and subsequent growth with the acquisition of products, the company’s workflows became inundated with multiple systems and different data models driving article submissions from authors - a key business process. Data analysts across different teams would use several manual processes while navigating a complex ecosystem of multiple data stores to build an aggregate view of the business, identify trends and support...

Read

Revolutionizing Marine Conservation with a Scalable Data Platform

The world’s largest marine ecosystem, The Great Barrier Reef, is in peril. As climate change and myriad other threats expose the reef to bleaching, it is paramount to gather data about The Reef - a task that cannot be accomplished by an individual or entity alone. In partnership with Citizens of the Great Barrier Reef, we embarked on an ambitious journey to harness technology for conservation. Our mission: to create a scalable, feature-rich data platform that would drive one of...

Read

Transforming Out-of-Home Advertising with State-of-the-art Products

Talon, a specialist Out-of-Home (OOH) agency, embarked on a digital transformation journey in partnership with Sahaj, aiming to re-imagine OOH by utilizing technology to not just automate, simplify and optimise outcomes but also incorporate cutting-edge technologies to make the medium attributable. Three purpose-built platforms emerged from this collaborative journey: Plato revolutionizes OOH inventory management by aggregating paper and digital inventory across multiple markets into a single intuitive campaign planning and delivery system. It allows users to explore, plan, check availability,...

Read