Big Data – Publication

Challenge

The client was facing issues in storing the growing data generated, especially since data can accumulate to gigabytes and even terabytes in a relatively short period.

Client Overview

The client is a California-based publication house having 1800 publishers that generate digital data logs in terabytes. They generate on-demand reports or graphs with the help of data collected to develop and incorporate new business strategies and direct planning for the future.

Solution

Agile Soft Systems provided the Big Data empowered solution for the client. To help them solve the scalability issue and make their system highly available.

Benefits

The system helps the client to handle the large amount of data and keep scaling the storage as per the business growth.

Project Details

Challenges They were Facing!

Business Challenges

The client collects the data in the form of logs w.r.t. different transactions. The acquisition, storage, access, and analysis of digital information provide knowledge and strategic insights to all higher management of each publisher. Each time users visit a website or make a transaction, the company can log everything they do; every link they click, each form they submit, what time they log in and out, any errors we encounter, and just about anything else they can imagine. With the increase in the business data, the client was facing processing it in their existing system.

Technical Challenges

The existing system client collects the logs, parses them, and gets stored in RDBMS. Various BI reports were extracted from RDBMS data. With the increase in business, it is the storage and access aspects (requisites for later BI analytics) that started imposing challenges on how to store this growing data. Data management becomes a greater obstacle as more and more data needs to be collected and stored. As the size of data sets was getting increased, it was possible to run out of space on a single system.

What We Delivered!

Experts at AgileSoft Systems studied the existing data flow and the rate getting generated on a regular basis. Then we came up with a system that can handle a larger amount of data and keep scaling to keep up with growth and that can provide the input/output operations per second (IOPS) necessary to deliver data to analytic tools. We chose the Hadoop ecosystem, which not only solved the scalability issue but also made the system highly available.

  • All-access logs pipelined to flume source and sink to S3 file system real-time eliminates the need for local storage.
  • Multiple maps reduce jobs written that apply business logic and transform the data to a format that is much more reduced and useful for reports.
  • Hive tables are mapped to the file system. It allows the HiveSQL to be run on Demand and generate the reports.
  • All MR jobs, HiveTables, and Oozie Workflow are hosted in the cloud, which removes the issue of hardware elasticity.

Outcome

  • Faster On Demand reports are available to management
  • Scalability increased with high availability
  • Hardware and maintenance costs were reduced by 40%
  • Cloud hosting made the system collaborative and secure