Setting up Anaconda and Pyspark on M1 Mac

Setting up Anaconda and Pyspark on M1 Mac

·

2 min read

Steps to install Anaconda to run Pyspark projects.

Install Prerequisites

Install Homebrew or update it if already installed:

brew update

Install Python

brew install python@3.9

Install Java

brew cask install java

Install Scala

brew install scala

Install Spark

brew install apache-spark

Install Anaconda from website or:

brew install --cask anaconda

Setup environment variables and path

I'm using zsh, but if using default bash, use change ./bashrc:

vim ~/.zshrc

Add the following to .zshrc:

export HOMEBREW_OPT="/opt/homebrew/opt"
export JAVA_HOME="$HOMEBREW_OPT/openjdk/"
export SPARK_HOME="$HOMEBREW_OPT/apache-spark/libexec"
export PATH="$JAVA_HOME:$SPARK_HOME:$PATH"

Refresh Terminal with the new settings or just restart it.

source ~/.zshrc

Errors

For some tests, I was getting:

Error 23:  Too many open files in system

You can see what the current values are for file limits by the kernel:

sysctl kern.maxfiles
sysctl kern.maxfilesperproc

To resolve this error I updated the values for kern.maxfiles and kern.maxfilesperproc:

sudo sysctl -w kern.maxfiles=20480
sudo sysctl -w kern.maxfilesperproc=18000

These lines could be added to the ~/.zshrc or ~/.bashrc once the appropriate values are determined for your project. Downside is that it requires sudo. Alternatively, configs can be altered.

Summary

I just wanted to see if it could be done on a M1 Macbook Air (7-cores with 8GB of RAM).

Running 306 Tests in 10m:07s. Failures with two tests. One involving ARIMA model and the other datetime format assertion error.

WSL on a Dell 5550 with Intel Core i9-10885H @ 2.40GHz and 32GB of RAM:

time pytest -n 8 --dist=loadscope .
...
==================================== 306 passed, 526 warnings in 535.95s (0:08:55) =====================================

real    8m56.466s
user    13m29.008s
sys     37m3.325s

References