Getting the right data to the right people at the right time is the name of the game in today’s demanding marketplace.
Every company has to find a way to harness big data and use it to drive growth. And if your organization isn’t talking big data, you are at a competitive disadvantage.
This article covers a top-level view of big data’s evolution and key components. It can help you understand the importance of big data and technologies that are essential to its discussion.
With this foundation, you can proceed to the next step — addressing what to do with your data and how.
Every passing moment, the pace of data creation continues to compound.
In the time it takes you to read these few paragraphs there will be:
Every few minutes, this cycle repeats and grows.
In 2019, 90% of the world’s digital data had been created in the prior two years alone.
By 2025, the global datasphere will grow to 175 zettabytes (up from 45 zettabytes in 2019). And nearly 30% of the world’s data will need real-time processing.
Over the last decade, an entire ecosystem of technologies has emerged to meet the business demand for processing an unprecedented amount of consumer data.
Big data happens when there is more input than can be processed using current data management systems.
The arrival of smartphones and tablets was the tipping point that led to big data. With the internet as the catalyst, data creation exploded with the ability to have music, documents, books, movies, conversations, images, text messages, announcements, and alerts readily accessible.
Digital channels (websites, applications, social media) exist to entertain, inform, and add convenience to our lives. But their role goes beyond the consumer audience — accumulating invaluable data to inform business strategies.
Digital technology that logs, aggregates, and integrates with open data sources enables organizations to get the most out of their data, and methodically improves bottom lines. Big data can be categorized into structured, unstructured, and semi-structured formats.
Until recently, businesses relied on basic technologies from select vendors. In the 1980s, Windows and the Mac OS debuted with integrated data management technology, and early versions of relational database engines began to become commercially viable.
Then Linux came onto the scene in 1991, releasing a free operating system kernel. This paved the way for big data management.
Big data technologies refer to the software specifically designed to analyze, process, and extract information from complex data sets. There are different programs and systems that can do this.
In the early 2000s, Google proposed the Google file system, a technology for indexing and managing mounting data. A key tenet to the idea was using more low-cost machines to accomplish big tasks more efficiently and inexpensively than the hardware on a central server.
Before the Information Age, data was transactional and structured. Today’s data is assorted and needs a file system that can ingest and sort massive influxes of unstructured data. Open-source and commercial software tools automate the necessary actions to enable the new varieties of data, and its attendant metadata, to be readily available for analysis.
Inspired by the promise of distributing the processing load for the increasing volumes of data, Doug Cutting and Mike Cafarella created Hadoop in 2005.
The Apache Software Foundation took the value of data to the next level with the release of Hadoop in Dec. 2011. Today, this open-source software technology is packaged with services and support from new vendors to manage companies’ most valuable asset: data.
The Hadoop architecture relies on distributing workloads across numerous low-cost commodity servers. Each of these “pizza boxes” (so called because they are an inch high and less than 20 inches wide and deep) has a CPU, memory, and disk storage. They are simple servers with the ability to process immense amounts of various, unstructured data when running as nodes in a Hadoop cluster.
A more powerful machine called the “name node” manages the distribution of incoming data across the nodes. By default, data is written to at least three nodes and might not exist in its entirety as a single file in any one node. Below is a simple diagram that illustrates the Hadoop architecture at work.
The majority of enterprises today use open source software (OSS). From operating systems to utilities to data management software, OSS has become the standard fare for corporate software development groups.
Serving as a progressive OSS organization, Apache Software Foundation is a non-profit group of thousands of volunteers who contribute their time and skills to building useful software tools.
As the creators, Apache continuously works to enhance Hadoop code — including its distributed file system called Hadoop Distributed File System (HDFS) — as well as the code distribution and execution features known as MapReduce.
Within the past few years, Apache released nearly 50 related software systems and components for the Hadoop ecosystem. Several of these systems have counterparts in the commercial software industry.
Vendors have packaged Apache’s Hadoop with user interfaces and extensions, while offering enterprise-class support for a service fee. In this segment of the OSS industry, Cloudera, Hortonworks, and Pivotal are leading firms serving big data environments.
Now software systems are so tightly developed to the core Hadoop environment that no commercial vendor has attempted to assimilate the functionality. The range of OSS systems, tools, products, and extensions to Hadoop include capabilities to import, query, secure, schedule, manage, and analyze data from various sources.
Corporate NAS and SAN technologies, cloud storage, and on-demand programmatic requests returning JSON, XML, or other structures are often secure repositories of ancillary data. The same applies to public datasets — freely available datasets, in many cases for economic activity by industry classification, weather, demographics, location data, and thousands more topics. Data of this measure demands storage.
Distributed file systems greatly reduce storage costs while providing redundancy and high availability. Each node has its local storage. These drives don’t require speed or solid-state drives, commonly called SSDs.
They are inexpensive, high-capacity pedestrian drives. Upon ingestion, each file is written to three drives by default. Hadoop’s management tools and the Name Node monitor each node’s activity and health so that poorly performing nodes can be bypassed or taken out of the distributed file system index for maintenance.
The term “data lake” describes the vast storage of different types of data. These vastly different data sources are in a minimum of a dozen different file formats.
Some are compressed or zipped. Some have associated machine data, as found in photos taken with any phone or digital camera. The date, camera settings, and often the location are available for analysis.
For example, a query to the lake for text messages that included an image taken between 9 p.m. and 2 a.m. on Friday or Saturday nights in Orlando on an iPhone would probably show fireworks at Disney World in at least 25% of the images.
The enterprise administration of applications — their storage requirements, security granularity, compliance, and dependencies — required Hadoop distributions to mature these capabilities in the course of becoming a managed service to an enterprise (like those from Cloudera and Hortonworks).
In the graphic above, you can see a view of Hadoop’s place among other software ecosystems. Note that popular analysis tools (below) are valuable in developing big data solutions:
Administration through Cisco and HP tools is common.
Commercial software companies have begun connecting to Hadoop, offering functionality such as:
from companies such as IBM, Microsoft, Informatica, SAP, Tableau, Experian, and other standard carriers.
Analytics is the endgame for developing a big data environment. The rise of big data has given credence to a new resource classification, the data scientist — a person who embodies an analyst, technologist, and statistician all in one.
Using several approaches, a data scientist might perform exploratory queries using Spark or Impala, or might use a programming language such as R or Python. As a free language, R is rapidly growing in popularity. It is approachable by anyone who is comfortable with macro languages such as those found in Excel. R and its libraries implement statistical and graphical techniques.
Cloud computing is very different from server-class hardware and software. It involves cloud storage, multi-tenant shared hosts, and managed virtual servers that are not housed on a company’s premises. In cloud environments, an organization does not own equipment, nor does it employ the network and security technologists to manage the systems.
Cloud computing provides a hosted experience, where services are fully remote and accessed with a browser.
The investment to build a 10- or 20-node HDFS cluster in the cloud is relatively small compared to the cost of implementing a large-scale server cluster with conventional technologies. The initial build-out of redundant centers by Amazon, Microsoft, Google, IBM, Rackspace, and others has passed. We now have systems available at prices below the cost of a single technician. Today, cloud computing fees change rapidly with pricing measured by various usage patterns.
Big data is not a fad or soon-to-fade trending hashtag. Since you began reading this article, more than 100 million photos have been created, with a sizeable portion having a first-degree relationship to your industry. And the pace of data creation continues to increase.
The distribution of computing processes can help organizations to gain a 360-degree view of their customers through big data collection and analysis. And companies that embrace big data technologies and solutions will rise ahead of their competitors.
Big data technologies are becoming an industry standard in finance, commerce, insurance, healthcare, and distribution. Welcoming big data technologies and solutions is key in the optimization and continued growth going forward.
Companies that embrace data solutions can continue to improve management and operational processes and create a competitive advantage to withstand an ever-evolving marketplace.