Doug Cutting: Big Data Is No Bubble
Even the hype can’t spoil the real future of Big Data, says Hadoop creator Doug Cutting
Apache Hadoop is at the heart of the Big Data movement, and Doug Cutting is its co-creator. He is also now architect at Cloudera, the company whose Hadoop distribution is leading the Big Data world, thanks to deals with Microsoft, Oracle, IBM and others/
Hadoop originated at Yahoo, where it was developed to support distributed processing. Cutting and Michael Cafarella built it for the Nutch search engine which Yahoo was working on at the time, but it has gone wider than that, so now no one in the industry is unaware of Big Data – and Hadoop.
TechWeekEurope was pleased to get some time with Cutting, at IP Expo in London this week. We first had to work through some definitions, consider where Big Data was going to – and found some time to examine the other key role Cutting has: he is actually chair of the Apache Foundation – the leading open source body, which manages Hadoop along with other leading open source projects such as OpenOffice, and the Apache Server.
Could he live up to this reputation?
First of all, Cutting explained why Hadoop has the role it does. The Hadoop Distributed File System can handle multiple types of data in large quantities, by distributing the processing onto clusters of low-cost servers – the basis of real Big Data.
Cloudera’s CDH distribution includes Hadoop, along with the HBase key value store (roughly equivalent to a database – but without the limitations of columns and SQL), language tools including Hive, and the MapReduce tool developed at Google.
“I think of Big Data as this style of computing where you take advantage of commodity hardware and commodity software and the availability of bulk storage in an integrated platform – and you move the computation to the data,” he says.
“You start with hardware, a bunch of nodes that are all on the same high speed network, and you build a software platform on top of this which makes it appear to be one big computer. It’s a different model of computing.”
Given how specific his idea of Big Data is, is he concerned about the current level of hype – and therefore confusion – around Big Data?
“Do I mind the term Big Data?” he says. “No, we have to call it something.”
“Am I worried whether this is a bubble?” he goes on. “I’ve been involved in boom and bust hype cycles for many years, and most people in a boom say ‘Oh, this one is different!’ But I really believe that the trends that underlie this really are long term trends. Moore’s law tells us that there will be cheaper and faster hardware year by year. And there is a trend to greater automation in industry, which will be the primary driver of economic growth.”
In any sector of business, it is software and automation which will provide the growth, he says, and Big Data will be the driving force within that.
“The roadmap that Google provides us with really gives me hope that this is not a one-trick pony,” he says. “We have a software foundation that we can build an incredible array of applications on. I see no reason that it can’t become the mainstay of enterprise computing.”
The arrival of Big Data is equivalent to the arrival of the PC or the database, he says. Something new is being done, and it will be transformational, changing how existing technology is used.
What about the danger that the hype means Big Data will outgrow its strength? So many Big Data projects are starting, there can’t be enough experts to do them all, surely?
“Skills are a restraint on growth,” he agrees, but it’s also an opportunity for Cloudera, which is training about 1500 people a month, either within their own organisations, or in visits to cities. “That business is growing – we are hiring more instructors.”
The software can also make it a lot easier, he says. “You can’t automate everything, but there is a long way we can go.” And he believes Cloudera’s tools do just that.
The Big Data approach also takes the strain off hardware administrators, says Cutting, as it automates the process of restoring data when a drive fails. “Once a month you replace all the drives that have failed,” he says. “In the conventional world, even with RAID [Redundant Array of Inexpensive Disks] you still have to get in pretty quick when a drive fails. In the Hadoop world, people just run with dead drives.”
Working with Apache Hadoop, commercialisation is easy, thanks to the Apache licence, says Cutting, and Cloudera follows a well-worn open source model: ”Cloudera sells subscriptions, which include two things – support and the management suite.”
As with other open source software, you can use CDH for free, but if you want support you have to pay Cloudera – and install the suite. “We don’t sell it a la carte because the management suite makes it easier for us to support the software. If you are using our tool to configure things, it limits the amount of rope you have to hang yourself with.”
He won’t say how much business Cloudera is doing, but revenue is growing annually, and the company has around 300 customers – apparently including half the Fortune 50, though none of them will talk about the software. Cutting says this is for competitive reasons. One public customer they do have is games company King.com.
All this is far better than the old model of charging licence fees: “Write it once, and charge for it forever? That’s like the music industry, and the model isn’t working well there.”
Lots of people use CDH who don’t pay of course, but “there is no mileage in resenting that”, he says. Like all open source projects, it’s about mindshare – and CDH is the most established distribution, he says, though he gives a tip of the hat to the Bigtop community distribution which comes straight from Apache and is “sort of analogous to Fedora compared with Red Hat”. Bigtop will be doing some bleeding-edge work, he says – but CDH comes with a contractual obligation for long-term support.
Getting the big boys in line
One big endorsement is the fact that Oracle, despite the obvious threat to the way it does business, has endorsed Big Data and Hadoop. Cutting’s keynote at IP Expo is introduced by Mike Connaughton, Oracle’s Big Data manager for EMEA, clearly happy to endorse the idea. And later that day, Oracle COO Mark Hurd praises both Hadoop and Big Data.
Cutting is almost embarrassed by the success. “A couple of years ago, we were sort of worried. What is going to happen with Oracle, What is going to happen with Microsoft? What is going to happen with IBM?” Cutting says the Hadoop community expected them to recognise the importance of Big Data, and then come in with their own proprietary version of it.
“We expected we would have to argue this out in the marketplace, ” he says, “and it would be tough. But down the line, they have all endorsed Hadoop as their platform of choice. Some of them have become partners with Cloudera, some have gone it alone, and some are in between. But they’ve all used the same open source platform – which surprised me.”
He’s gracious about it: “It is a real endorsement of the platform on one hand, but also of the level of maturity that Oracle has reached. They are not going to try and deny it, and they are not going to try and completely control it – at least not at this point. They realise this is something they have to be involved with.
“To me it looks like they did a really honest appraisal and said, ‘We need to be there, and this is the most effective way to be there’. I think the open source model makes that less frightening in that should they ever be dissatisfied with our relationship they would not have to abandon their customers who are using Hadoop, They could build their own proprietary tools on top – they are not prohibited.”
With Microsoft, IBM, Dell and others on board, this level of consensus around one platform is “unprecedented” says Cutting. “They haven’t done this in the past. Their cloud efforts are all incompatible. There isn’t a single cloud API, nor one for virtualisation.”
But what about the challenge to Oracle’s RDBMS-based, licence-based world? “Think about the choice,” says Cutting. “If they just deny it, and if the rest of the world realises that RDBMS is not everything, they would wither. The choice was clear. They needed to get involved in this, and this is an effective way to do it.”
In any case, relational databases aren’t going away – it may displace some RDBMS work, but Big Data “complements RDBMS” by doing things it never could.
What about rivals? There are other NoSQL approaches, he says – though he doesn’t like the term ”NoSQL”.
Cutting reckons other tools, like MongoDB and Couchbase tend to be “point offerings” compared with CDH, he says, because the Cloudera package includes the whole thing, with the Hbase store, along with MapReduce, Pig and Hive – and more tools will be added.
“If your data is in HDFS all these tools are available without moving it,” he says, “and if you have a petabyte of data, it is expensive to move it. If all you need is a key value store, then you might go and buy a NoSQL store or pick up an open source one. But if you also want to do some analytics, you have more processes you want to have on the data, and organisations that want to share the data, you need to be on a more general purpose platform with a suite of tools.”
Hadoop is more like a whole operating system, he says. “You are not stuck with an OS that only has one feature – that’s not a terribly useful OS.” and HBase, the store, is a runaway success. “We had the first HBase con last year, and it was huge. We had 600 people – for the first conference around a point technology.”
So will MongoDB and Couchbase and the rest be the Progress and Ingres of the Big Data world, left behind by the big success story of Hadoop, as those others were left in the dust (eventually) by Oracle? “I hate to be critical but that doesn’t seem to be an unreasonable analysis to me. I have a hard time seeing them succeed and grow.”
However, like the other 80s database giants, rival NoSQL players could be around for some time: “People invested in a particular technology may have to stick with it for a decade.”
While we have Cutting in front of us, we have to ask about his role chairing the not-for-profit Apache Foundation. Is that like being a scoutmaster?
“That makes it sound better than it is,” he says. “It’s more like parenting.
“I am chairman of the board of directors. It is an all-voluntary organisation, and we have a few part-time staff. There is no top-down control. What there is at the top-down level, is a bit of discipline. We have standards we expect projects to adhere to, so no contributors are privileged.”
That means everyone has to take a share of listening to the crazies, he says, and it also means he has to review around 40 projects a month (Apache has more than 100 projects, and they get quarterly reviews). He also, somewhat ruefully says, “I have to make the agenda, and try to make the meeting run on schedule.”
In three years, he’s not had any spats he wants to talk to the media about (including the mudslinging around Oracle and the Java Community a couple of years back).
He does have a weather eye on activity around patents, though. Anyone who contributes software to an Apache project must do so with a license to any patents, he says. “Don’t try to use the Apache licence to submarine your patent out to users.”
But Apache’s manifestly beneficial role may be its best protection. It’s clearly doing good and useful work, and no one would lightly tangle with the body, for fear of looking very bad in comparison.
And that’s more or less how we feel about Cutting.
Are you a patent expert? Try our quiz!