You can do many things with your big data if you can bring it to the cloud. You can run business intelligence or data analytics with it and obtain valuable insights. You can make it available to consumers and customers anytime/anywhere and facilitate better collaboration and product distribution. Or you can simply store it for safekeeping.
But before you can leverage the power of the cloud, there's one big obstacle you need to hurdle - getting all that big data there.
The problem that just keeps getting bigger
Like all big things, it would take a great deal of work to move big data from one point to another. Let's now try to get a handle on how large these data sets can be.
According to McKinsey Global Institute's 2011 report entitled "Big Data: The Next Frontier for Innovation, Competition, and productivity", almost all sectors in the United States have, on average, hundreds of terabytes of data stored per company. Many companies have even already exceeded the 1 petabyte mark.
The report also reveals how the volume of data is growing at a tremendous pace as companies gather even more data per transaction and per interaction with customers and consumers. In areas like health care, security, retail, utilities, manufacturing, and transportation, data is being collected not only through traditional interfaces like computer terminals but also through RFID tags and all sorts of sensors.
Some of the individual files produced during data collection have gotten much bigger than before. In health care, for instance, clinical data can now come in the form of images (e.g. from X-rays, CT-scan, and ultrasound) and videos. Imaging data collected from one patient alone can easily consume several Gigabytes of storage space.
If you think that's big, consider the volume of data gathered by system monitors from a Boeing 737. A single cross-country flight of just one 737 can already generate 240 terabytes of data.
Even we, the general public, are willingly contributing to the explosive growth of big data as more of us create and consume multimedia, transact online, interact with one another through social media, and use mobile devices.
The sheer size alone of all the data that has to be moved to the cloud can already be a game changer. But really, the size of big data is just half of the story.
Just how long can it take to transfer big data to the cloud?
Now that we have an idea of the data sizes we're dealing with, it's time to talk about the capacities of the transport mechanisms we have on hand. Since the usual way of transporting data to the cloud is through an Internet connection, it's important to know how large typical bandwidths are these days.
Small and medium-sized businesses in the US typically have Internet connections with upload speeds of up to 10 Mbps (Megabits per second). At that speed, a 100 GB upload will need about a day to complete. Most people, on the other hand, have upload speeds of only around 0.6 Mbps. This would theoretically translate to a 2-day upload for the same 100 GB load.
But how about those companies that handle terabytes of data? Here are upload times of a one (1) terabyte load over some of the more common Internet network technologies (This is from a blog posted by Werner Vogels, Amazon.com's CTO):
|10 Mbps||13 Days|
|100 Mbps||1-2 days|
|1 Gbps||less than a day|
For companies who deal with hundreds of terabytes like those serving online movies, uploading files at these speeds is simply not feasible. Clearly, when you put together the size of big data and the width of the pipe (i.e., your Internet connection) you're going to transport it through, what you'll get is an insanely slow process.
That is why even Amazon is offering a "manual" transport service for those customers who are looking for a faster solution for moving volumes of data to the cloud. This service, known as AWS Import/Export, involves shipping portable storage devices whose data contents are then loaded up to Amazon S3.
Increasing bandwidth certainly looks like a logical solution. Unfortunately, file sizes and bandwidths aren't the only things that factor into a big data transfer.
All those upload speeds are actually only good in theory. In the real world, you really can't just get an estimate of the upload time based on your bandwidth and your file size. That's because you need to factor in a couple more things that can slow the process even more. One of it is your location with respect to that specific part of the cloud you'll be uploading files to. The farther the distance, the longer uploads will take.
Where our problem lies
The root of the problem lies in the very nature of the network technology (or protocol) we normally use to transfer files, which is TCP (Transmission Control Protocol). TCP is very senstitive to network conditions like latency and packet loss. Sadly, when you have to transfer big files over a Wide Area Network (WAN), which really sits between your offline data and your destination in the cloud, latency and packet loss can adversely affect your transfer in a big way.
I won't be discussing the technical details of this problem here but if you want to know more about it, how serious it is, and how we are able to solve it, I encourage you to download the whitepaper entitled "How to Boost File Transfer Speeds 100x Without Increasing Your Bandwidth".
For now, let me just say that even if you increase your bandwidth, latency and packet loss can bring down your effective througphut (actual transfer speed) substantially. Again, depending where you are with respect to your destination in the cloud, your effective throughput can be only 50% to even just 1% of what is being advertised. Not very cost-effective, is it?
The fastest way to send large files to the cloud
A better way to transfer big files to the cloud would be to take advantage of a hybrid transfer protocol known as AFTP (Accelerated File Transfer Protocol). This protocol is a TCP/UDP hybrid that can boost file transfer speeds up to 100%, which practically cancels out the effects of latency and packet loss.
Because AFTP is supported by JSCAPE MFT Server, you can deploy an EC2 instance of JSCAPE MFT Server on Amazon and then use it to provide an AFTP file transfer service. For zero financial risk, you can give the free evaluation version a test run be clicking the download button at the end of this blog post. Once your server is set up, you can then upload files via an AFTP-enabled file transfer client like AnyClient (it's also free) or a locally installed instance of JSCAPE MFT Server.
As soon as you've moved all your data to the cloud, you can make them available to other cloud-based applications.
Poor network conditions can prevent you from harnessing the potential of big data cloud computing. One way to address this problem is by avoiding an Internet file transfer altogether and simply shipping portable storage devices containing your data to your cloud service providers.
Or you can use AFTP.