Apache Spark

Apache Spark is an open-source cluster computing framework originally developed in the AMPLab at UC Berkeley. In contrast to Hadoop’s two-stage disk-based MapReduce paradigm, Spark’s in-memory primitives provide performance up to 100 times faster for certain applications. By allowing user programs to load data into a cluster’s memory and query it repeatedly, Spark is well-suited to machine learning algorithms.


Requirements: Hadoop, YARN

Install Hadoop

Insatll YARN

Install Java



From Pandas to Apache Spark’s DataFrame

Big Data Stack: HDFS, Kibana, ElasticSearch, Neo4J, Apache Spark


4 thoughts on “Apache Spark

  1. Pingback: Kibana | datayo
  2. Pingback: Neo4J | datayo

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s