Post number 9 in a series of 12 from one of our provider partners, NTT.
Big Data Challenges
With the growth of video, web, social media and data analytics, data is exploding exponentially. Both large and small organizations need to figure out how to make sense of all this data. There are a lot of interesting new ideas in the tech space regarding data interpretation. The big challenge right now is how and where to analyze it.
This problem is not a new one. Several years ago I worked with one of the National Science Labs on Beowulf HPC clusters. What they realized was that while using one extremely powerful server to do all the computing was effective, they still encountered bottlenecks. That large server could only perform a certain number of calculations in a serial fashion. In other words, it would get its unit of work done and then need to process the next unit. Not only did it have performance limitations but this large powerful computer was also very expensive. The scientists figured out that if they ran 100s or 1000s of low-end servers in parallel, they could increase the speed at which their computations were completed. The basics of the architecture—there were a certain number of master nodes that assigned the work, and there were additional worker nodes that did computations and sent them back to the master nodes to collect and analyze the results. At the time, this application had very little use outside of a research department.
This has changed significantly with the mountains of data that are now produced every day. More companies and individuals are analyzing what is being said and done and are experiencing big data challenges. The amount of workers that have access to computers, as well as the myriad of other devices that create and store information, is growing daily. In my job, I am looking to create tangible value for my company. Three billion people are trying to do the same. Companies of all sizes and in nearly every vertical are increasingly tasked with decoding the information being generated by these many machines. New data sources are creating new data types, including web data, clickstreams, location data, point of sale, social data, building sensors, vehicle and aircraft data, satellite images, medical images, log files, network data, weather data, etc. Also, video and streaming applications are increasing the resolution of images that are being captured. As the quality of images goes up, so does the size of the files.
The need to query the data may not be consistent, so investing in infrastructure that may have a value for a limited time does not make sense.”
Some “big data” applications can stream hundreds of millions of application instructions, logs, or data points into storage each day using the open-source technology of Hadoop and its other offshoot tools. Users may use analytical programs such as Hadoop or MapReduce to perform a wide range of analytics on this data to better understand their customer needs. These analytics include:
- Evaluation of new features
- Machine learning
- Exploratory analysis
- Event correlation
- Trend analysis
Based on the work originally started by research organizations and then perfected by companies like Yahoo and Google, solutions like Hadoop and MongoDB were created to address the problem of analyzing large datasets. If you look into those solutions you will find that a large infrastructure solution is needed to support the computing. For example, Hadoop has an upper limit of 4000 compute nodes for a compute cluster. The type of servers are generally low cost but if you calculate all the other infrastructure that is needed to run the cluster, the cost starts to scale pretty quickly. If you were to use inexpensive 1u servers that fully filled a rack, you would need 96 racks to meet this requirement. In addition to a $2-5K cost per server, you have to provide rack space, power, cooling, network and electrical connections, sometimes storage connectivity, and software licenses. If you assumed that each server had two power supplies and needed 2 network connections and 2 storage connections, you would need to support 8000 power supplies, 8000 Network Ports (167 line cards in a director class Ethernet Switch) and 8000 Storage connections.
Open source software is common in these clusters but other licenses may be needed. On top of all that, there are software and maintenance costs. You may start with a small cluster but as the amount of data grows and the amount of time you need to analyze the data shrinks, your cluster will need to grow.
In addition to the factors mentioned above, the need to query the data may not be consistent, so investing in infrastructure that may have a value for a limited time does not make sense.
Public and private clouds help bridge the gap by providing expandable resources to fill those holes by:
- Taking away most of the cost of building and maintaining the infrastructure
- Quickly scaling resources of analytics
- Providing a pay-per-job model, as opposed to needing everything upfront
Advantages of “Big Data” in a Hybrid Cloud
Using a hybrid cloud from a public cloud provider gives you many benefits.
- Both virtualized or bare-metal servers may be used depending on your application needs.
- It eliminates the cost and time of maintaining the infrastructure. The main difference between building and managing your own environment and having someone else do it is about how you want to use your resources. Cloud providers give you more time to solve the difficult problems that “big data” is causing. Also building your own infrastructure means that you incur the costs of the room, the power and cooling of that room, as well as the ongoing maintenance of those assets.
- SLAs are also very important because if you build your own cloud and can’t live up to your SLAs, there is no penalty imposed on your business. With most 3rd parties, there is an agreement with penalties (including monetary consequences) if they can’t live up to what they say.
So why are other customers looking at hybrid solutions from NTT Communications to power their “big data” applications?
- Flexibility: A customer can choose the size and hardware used for everything from their analytics clusters to their changing requirements using virtual or physical resources. Data scientists can spin up ad hoc clusters for urgent analysis, engineers can run their own dedicated cluster to test new applications, and the data analytics team can dynamically expand their reporting cluster during high load periods.
- Security: Customers can take advantage of dedicated servers and security devices for their infrastructure.
- Ease-of-use: NTT Communications simplifies the process of deploying and managing the infrastructure while providing seamless integration with storage.
- Cost: Because customers are only charged for the resources they use, costs are reduced significantly by standing up ad hoc clusters for short-lived work. Customers can also avoid dedicated resources and eliminate the additional support needed to maintain their cluster and keep the lights on.
Next Post: Moving Enterprises to a Public or Hybrid Cloud Part 10 – Reducing Software Costs
Contact StrataCore to learn more about NTT America Cloud services (206) 686-3211 or stratacore.com
About the author: Ladd Wimmer
Ladd Wimmer is a valuable member of the NTT Communications team. He has over 15 years of experience implementing, architecting, and supporting enterprise servers, storage and virtualization solutions in a variety of IT computing and service provider environments. He worked as Systems Engineer/Solution Architect in the Data Center Solutions practice of Dimension Data, most recently serving as technical lead for the roll out of Cisco’s UCS and VCE vBlock platforms across the Dimension Data customer base. Ladd has also run two IBM partner lab environments and worked for an early SaaS provider that created lab environments for Sales, QA testing and Training.
{{cta(’35f3e6a4-addc-4538-b9d9-fc6e80c42539′)}}