In this tutorial, we will see How to Install PySpark with JAVA 8 on Ubuntu 18.04?
We will install Java 8, Spark and configured all the environment variables.
My machine has ubuntu 18.04 and I am using Java 8 along with Anaconda3. If you follow the steps, you should be able to install PySpark without any problem.
Make Sure That You Have Java Installed
If you don’t, run the following command in the terminal:
sudo apt install openjdk-8-jdk
After installation, if you type the java -version in the terminal you will get:
openjdk version "1.8.0_212" OpenJDK Runtime Environment (build 1.8.0_212-8u212-b03-0ubuntu1.18.04.1-b03) OpenJDK 64-Bit Server VM (build 25.212-b03, mixed mode)
Download Spark from https://spark.apache.org/downloads.html
Remember the directory where you downloaded it. I got it in my default downloads folder where I will install spark.
Set the $JAVA_HOME Environment Variable
For this, run the following in the terminal:
sudo vim /etc/environment
It will open the file in vim. Then, in a new line after the PATH variable add
Type wq! and then hit enter. This will save the edit in the file. Later, in the terminal run
Don’t forget to run the last line in the terminal, as that will create the environment variable and load it in the currently running shell. Now, if you run
The output should be:
Just like it was added. Now some versions of ubuntu do not run the
/etc/environment file every time we open the terminal so it’s better to add it in the .bashrc file as the .bashrc file is loaded to the terminal every time it’s opened. So run the following command in the terminal,
File opens. Add at the end
We will add spark variables below it later. Exit for now and load the .bashrc file in the terminal again by running the following command.
Or you can exit this terminal and create another. By now, if you run echo $JAVA_HOME you should get the expected output.
This method is best for WSL (Windows Subsystem for Linux) Ubuntu:
pip install pyspark
Go to the directory where the spark zip file was downloaded and run the command to install it:
cd Downloads sudo tar -zxvf spark-2.4.3-bin-hadoop2.7.tgz
Note : If your spark file is of different version correct the name accordingly.
Configure Environment Variables for Spark
This step is only meant if you have installed in “Manual Way”
Add the following at the end,
export SPARK_HOME=~/Downloads/spark-2.4.3-bin-hadoop2.7 export PATH=$PATH:$SPARK_HOME/bin export PATH=$PATH:~/anaconda3/bin export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH export PYSPARK_DRIVER_PYTHON="jupyter" export PYSPARK_DRIVER_PYTHON_OPTS="notebook" export PYSPARK_PYTHON=python3 export PATH=$PATH:$JAVA_HOME/jre/bin
Save the file and exit. Finally, load the .bashrc file again in the terminal by
Finally, if you execute the below command it will launch Spark Shell.
cd $SPARK_HOME cd bin spark-shell --version