Big Data – Publication
On Demand Business Intelligence reports for US based publication company.
California based online publication company whose got 1800 publishers, that generates digital data(logs) in terabytes. The acquisition, storage, access, and analysis of digital information provides knowledge and strategic insights to all higher management of each publishers. Each time users visit a website or make a transaction, the company has the ability to log everything we do; every link we click, each form we submit, what time we log in and out, any errors we encounter, and just about anything else we can imagine. Generating the reports & graphs which is to be OnDemand on this data is what required and that helps management to develop and incorporate new business strategies and direct planning for the future.
Existing system used to collect the logs, parse it, and gets stored in RDBMS. Various BI reports extracted from RDBMS data. With increase in business, it is the storage and access aspects (requisites for later BI analytics) that started imposing challenges on how to store this growing data, especially since data can accumulate to gigabytes and even terabytes in a relatively short period of time. Data management becomes a greater obstacle as more and more data needs to be collected and stored. As the size of data sets was getting increased, it was possible to run out of space on a single system.
We studied the existing data flow, and rate with which data was getting generated, came up with a system which can handle very large amounts of data and keep scaling to keep up with growth, and that it can provide the input/output operations per second (IOPS) necessary to deliver data to analytic tools. We chooses the Hadoop eco-system, which not only solved the scalability issue, but also made the system highly available.
- All access logs pipe lined to flume source and sink to S3 file system real time. No need of local storage
- Multiple map reduce jobs written that applies business logic, and transform the data to a format which is much more reduced, and useful for reports
- Hive tables are mapped to file system. This allows the HiveSQL to be run on Demand and generate the reports
- All, MR jobs, HiveTables and Oozie Work flow are hosted to cloud, that removes the issue of hardware elasticity.
Value to Customer
- Faster OnDemad reports are available to management
- Scalability increased with high availability
- Hardware and maintenance cost reduced by 40 percent
- Cloud hosting made the system collaborative and secure.