Hadoop, Spark, and Storm: Powering the World of Data

With Big Data comes big responsibility, and the need to process, store and analyse this data. The most popular Big Data tools used now are Hadoop, Spark and Storm. Not just developers, even well-known enterprises use one or all of these tools for their Big Data needs. Here we list out top 5 organizations using each of tools and how they are implemented:

Powered by Hadoop:

1. Alibaba: The biggest ecommerce platform in Asia uses a 15-node cluster dedicated to processing sorts of business data dumped out of database and joining them together. These data is then fed into iSearch, their vertical search engine. Each node has 8 cores, 16GB RAM and 1.4TB storage.

2. eBay Inc.: The multibillion dollar companyhas operations in 30 companies and a worldwide presence. eBay uses Hadoop for Search optimization and Research and reports a heavy usage of Java MapReduce, Apache Pig, Apache Hive, Apache HBase. They use 532 nodes cluster, with each node having 8 cores, and total storage of 5.3 PB.

3. Facebook: The fastest company in S&P 500 index to reach a market cap of $250 billion uses Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning. Theyhave 2 major clusters:
• A 1100-machine cluster with 8800 cores and about 12 PB raw storage.
• A 300-machine cluster with 2400 cores and about 3 PB raw storage.

4. Google: The internet giant has partnered with IBM to establish the University Initiative to Address Internet-Scale Computing Challenges. One of the resources created for this is a cluster of processors running an open source implementation of Google’s published computing infrastructure (MapReduce and GFS from Apache’s Hadoop project)

5. Twitter: The micro-blogging platform uses Apache Hadoop to store and process tweets, log files, and many other types of data generated across Twitter. They employ committers on Apache Pig, Apache Avro, Apache Hive, and Apache Cassandra.


Powered by Spark


1. NASA JPL: The NASA Jet Propulsion Laboratory uses Spark for their Deep Space Network, the largest and most sensitive scientific telecommunications system in the world.They are also working on SciSpark, a scalable scientific processing platform that uses Spark capabilities to access to data stored in disparate file systems.

2. Yahoo: Hadoop and Spark clusters are used for deep learning to gain intelligence from massive amounts of online data. They have designed CaffeOnSpark, a Spark deep learning package that enables distributed deep learning on a cluster of GPU and CPU servers.

3. Tripadvisor: The popular travel website uses Spark for massively parallel NLP. Spark powers their tagging algorithm as it is an excellent data parallel engine that allows one to spread the data among all the nodes in the cluster.

4. Uber: Every day this multinational online taxi dispatch company gathers terabytes of event data from its mobile users. By using Kafka, Spark Streaming, and HDFS, to build a continuous ETL pipeline, Uber converts raw unstructured event data into structured data as it is collected, and then uses it for further and more complex analytics.

5. Pinterest: Pinterestleverages Spark Streaming to gain immediate insight into how users all over the world are engaging with Pins—in real time. As a result, Pinterest can make more relevant recommendations as people navigate the site and see related Pins to help them.



Powered by Storm


1. Groupon: The global e-commerce platform with presence in 28 countries uses Storm to build real-time data integration systems. Storm helps themanalyze, clean, normalize, and resolve large amounts of non-unique data points with low latency and high throughput.

2. Baidu: China’s biggest search engine uses Storm to process the searching logs to supply realtime stats for accounting pv, ar-time etc. This project helps their operations team to determine and monitor services status and can do great things in the future.

3. Spotify: The music streaming servicehas Storm powering a wide range of real-time features, including music recommendation, monitoring, analytics, and ads targeting. Together with Kafka, memcached, Cassandra, and netty-zmtp based messaging, Storm enables Spotify to build low-latency fault-tolerant distributed systems with ease.

4. Flipboard: The world’s first and most popular social digital magazine is using Storm across a wide range of their services from content search, to realtime analytics, to generating custom magazine feeds. They integrate Storm across their infrastructure within systems like ElasticSearch, HBase, Hadoop and HDFS to create a highly scalable data platform.

5. WebMD: The online medical information and advisory platform uses Storm to power their Medscape Medpulse mobile application. Storm topology is capturing and processing tweets with twitter streaming API, enhancing tweets with metadata and images, doing real time NLP and executing several business rules at WebMD. Storm also monitors their selection of blogs, and powers their search indexing process.


Leave a Reply

Your email address will not be published. Required fields are marked *