Introduction to Apache Spark
Apache Spark is a distributed open-source and general-purpose framework used for clustered computing. It is designed to offer computational speed right from machine learning to stream processing to complex SQL queries. It is easy to process and distribute work on large datasets across multiple computers.
Moreover, it uses in-memory cluster computing to speed up the applications by reducing the need to write to disk. You get APIs for multiple programming languages such as Python, R, and Scala with Spark. These APIs eliminate the lower-level work that might otherwise be needed to manage big data.
Data collection in an efficient way is booming. There’s also an increase in production of data in an efficient and effective way and this has lead to rise in the new methodologies to analyze that data. Speed is essential among individuals and industries’ when it comes to analyzing through the large amount of information presented from all fronts. With the increase in the amount of data, the technology assigned to make sense of it all should be fast. Apache Spark is one of the newest open-source technologies, that offers this functionality. In this tutorial, you will learn about installing Apache Spark on Ubuntu.
Pre-requisites
- This tutorial is performed on a Self-Managed Ubuntu 18.04 server as the root user.
Install Dependencies
You should ensure that all your system packages are up to date. For checking this, execute the below command:
root@ubuntu1804:~# apt update -y
Since Java is needed to run Apache Spark, make sure that Java is installed. To check this, execute the below command:
root@ubuntu1804:~# apt install default-jdk -y
Download Apache Spark
Next you can download Apache Spark to the server. When this article was written, version 3.0.1 was the newest release. Download Apache Spark using the below command:
root@ubuntu1804:~# wget https://downloads.apache.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz –2020-09-09 22:18:41– https://downloads.apache.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz Resolving downloads.apache.org (downloads.apache.org)… 88.99.95.219, 2a01:4f8:10a:201a::2 Connecting to downloads.apache.org (downloads.apache.org)|88.99.95.219|:443… connected. HTTP request sent, awaiting response… 200 OK Length: 219929956 (210M) [application/x-gzip] Saving to: ‘spark-3.0.1-bin-hadoop2.7.tgz’ spark-3.0.1-bin-hadoop2.7.tgz 100%[=========================================================================>] 209.74M 24.1MB/s in 9.4s 2020-09-09 22:18:51 (22.3 MB/s) – ‘spark-3.0.1-bin-hadoop2.7.tgz’ saved [219929956/219929956]
After you have finished downloading, extract the Apache Spark tar file with the below command:
root@ubuntu1804:~# tar -xvzf spark-*
Finally, shift the extracted directory to /opt as below:
root@ubuntu1804:~# mv spark-3.0.1-bin-hadoop2.7/ /opt/spark root@ubuntu1804:~#
Configure the Environment
Before you start the Spark master server, you need to configure a few environmental variables. At first, set the environment variables in the .profile file with the below commands:
root@ubuntu1804:~# echo “export SPARK_HOME=/opt/spark” >> ~/.profile root@ubuntu1804:~# echo “export PATH=$PATH:/opt/spark/bin:/opt/spark/sbin” >> ~/.profile root@ubuntu1804:~# echo “export PYSPARK_PYTHON=/usr/bin/python3” >> ~/.profile
To verify if the new environment variables are accessible within the shell and available to Apache Spark, run the below command:
root@ubuntu1804:~# source ~/.profile root@ubuntu1804:~#
Start Apache Spark
When the environment is configured, you need to start the Spark master server. The essential directory was added to the system PATH variable by the previous command, so you can easily run the below command from any directory:
root@ubuntu1804:~# start-master.sh starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark-root-org.a pache.spark.deploy.master.Master-1-ubuntu1804.awesome.com.out
Here, the Apache Spark user interface is being run locally on a remote server. For viewing the web interface, you should use SSH tunneling to forward a port from the local machine to the server. Logout of the server and then execute the below command. Make sure you replace the hostname for your server’s hostname or IP:
ssh -L 8080:localhost:8080 root@ubuntu1804.awesome.com
You can now view the web interface from a browser on your local machine by visiting http://localhost:8080/. After the loading of web interface, copy the URL as you will need it in the next step.
Start Spark Worker Process
Here, the installation of Apache Spark is on a single machine. Therefore, the worker process will also be started on this server. Go to the terminal to start up the worker, using the below command and paste in the Spark URL from the web interface.
root@ubuntu1804:~# start-slave.sh spark://ubuntu1804.awesome.com:7077 starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-root-org.a pache.spark.deploy.worker.Worker-1-ubuntu1804.awesome.com.out root@ubuntu1804:~#
Because the worker is running, you will see it back in the web interface.
Verify Spark Shell
The web interface is easy to use but ensure that Spark’s command-line environment works as expected. Open the Spark Shell by executing the below command in the terminal:
root@ubuntu1804:~# spark-shell WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.0.1.jar) to constructor java.nio.DirectByteBuffer(long,int) WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform WARNING: Use –illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release 20/09/09 22:48:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable Using Spark’s default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to “WARN”. To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://ubuntu1804.awesome.com:4040 Spark context available as ‘sc’ (master = local[*], app id = local-1599706095232). Spark session available as ‘spark’. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ ‘_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.1 /_/ Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.8) Type in expressions to have them evaluated. Type :help for more information. scala> println(“Welcome to Spark!”) Welcome to Spark! scala>
You will get the Spark Shell in Scala as well as Python. By holding the CTRL key + D, exit the current Spark Shell. For testing the pyspark, run the below command:
root@ubuntu1804:~# pyspark Python 3.6.9 (default, Jul 17 2020, 12:50:27) [GCC 8.4.0] on linux Type “help”, “copyright”, “credits” or “license” for more information. WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.0.1.jar) to constructor java.nio.DirectByteBuffer(long,int) WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform WARNING: Use –illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release 20/09/09 22:52:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable Using Spark’s default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to “WARN”. To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ ‘_/ /__ / .__/\_,_/_/ /_/\_\ version 3.0.1 /_/ Using Python version 3.6.9 (default, Jul 17 2020 12:50:27) SparkSession available as ‘spark’. >>> print(‘Hello pyspark!’) Hello pyspark!
Note: You will get the above warnings if one of the newer versions of the Java JDK is installed. Java 8 won’t give you this error. As per https://github.com/apache/spark/pull/24825, and https://issues.apache.org/jira/browse/HADOOP-10848, this issue has been resolved.
Shut Down Apache Spark
In case, for any reason it becomes essential to turn off the main and worker Spark processes, run the below commands:
root@ubuntu1804:~# stop-slave.sh stopping org.apache.spark.deploy.worker.Worker root@ubuntu1804:~# stop-master.sh stopping org.apache.spark.deploy.master.Master
Conclusion
Apache Spark offers an intuitive interface for working with big datasets. In this tutorial, you have learned to get a basic setup going on a single system but, Apache Spark thrives on distributed systems. Hope that this information will help you get off and running with your next big data project!
Also Read
Learn To Add a User and Grant Root Privileges on Ubuntu 18.04
Learn to Restart Apache on Dedicated Server