최근 포토로그


Spark 2.0.1 on Yarn 2.7.3 configuration Spark

* Hadoop Yarn 에서 Spark Application을 수행하려면 기본적으로 숙지해야할 사항이 있다. 
(처음에 그냥 사이트 참조해서 빼껴다오면 엄청난 트러블슈팅을 발생함.....)

* 주의사항 : Spark를 Yarn에서 구동할 시에 Spark의 속성값을 디폴트 값을 사용하기위해 생략하기도 하는데
그럴 시에 상황에 따라 Hadoop의 속성값으로 매핑될 수도 있기 때문에 모두 지정해주는 것이 낫다.
(ex. yarn-client로 spark수행시 만약 spark.yarn.am.memory(디폴트 512m) 값을 생략한다면,
yarn.app.mapreduce.am.resource.mb 값을 1536m으로 지정되었을 시 1536m으로 지정됨)

* 서버스팩
  •  1번서버(hostname : server1) : master & slave / total memory : 128G / total cores : 12
  •  2번서버(hostname : server2) : slave / total memory : 128G / total cores : 14
* server1과 server2를 아래의 Configuration을 복사하여 그대로 쓰면 master 와 slave 룰에 맞게 속성값이 지정된다.
* 실제 slave 서버수가 늘어나게 되면, NodeManager 관련 속성값을 변경하면 된다.
* 미리 말씀드리자면 ResourceManager, NodeManager, NameNode, DataNode, 그리고 App를 수행하게되면 생성되는 AM(ApplicationMaster) 와 같은 용어들은 필히 스터디를 하는 것이 Hadoop  및 Spark를 핸들링할 때 좋을 것이다. 또한, Cloudera같은
Hadoop 배포판을 설치해서 한다고 해도 결국 아래의 Configuration을 핸들링 해야하기 때문에 스터디를 하는게 좋다. 꽁으로 안된다 ..ㅡㅡ

* Hadoop
masters
server1

slaves
server1
server2


core-site.xml

<configuration>
<!-- Hadoop basic config -->
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hduser/hadoop_store/tmp</value>
<description>A base for other temporary directories.(default:/tmp/hadoop-${user.name})</description>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://
server1:54310</value>
<description>The name of the default file system.(default:file:///)</description>
</property>

<!-- Only Oozie Config -->
<property>
<name>hadoop.proxyuser.hduser.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.hduser.groups</name>
<value>*</value>
</property>
</configuration>


hdfs-site.xml
<configuration>
<!-- Dfs default Config -->
<property>
<name>dfs.replication</name>
<value>2</value>
<description>Default block replication.(default:3)</description>
</property>
<property>
<name>dfs.secondary.http.address</name>
<value>
server1:50090</value>
<description>SecondaryNameNode WebServer URL</description>
</property>

<!-- Namenode Config -->
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/hduser/hadoop_store/hdfs/namenode</value>
<description>Determines where on the local filesystem the DFS name node should store the name table(fsimage).
(default:file://${hadoop.tmp.dir}/dfs/name)</description>
</property>
<property>
<name>dfs.http.address</name>
<value>
server1:50070</value>
<description>NameNode Admin URL</description>
</property>
<!-- Datanode Config -->
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/hduser/hadoop_store/hdfs/datanode</value>
<description>Determines where on the local filesystem an DFS data node should store its blocks.
(default:file://${hadoop.tmp.dir}/dfs/data)</description>
</property>
<!-- WebHDFS Config -->
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
</configuration>


yarn-site.xml
<configuration>
<!-- Configurations for NodeManager -->
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>102400</value>
<description>Resource i.e. available physical memory, in MB, for given NodeManager. (default:8192)</description>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>10</value>
<description>Number of vcores that can be allocated for containers. (default:8)</description>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
<description>Whether virtual memory limits will be enforced for containers. (default:true)</description>
</property>
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
<description>Whether physical memory limits will be enforced for containers. (default:true)</description>
</property>
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>4</value>
<description>Maximum ratio by which virtual memory usage of tasks may exceed physical memory. (default:2.1)</description>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle,spark_shuffle</value>
<description>mapreduce shuffle algorithm.(default:x)</description>
</property>
<property>
<name>yarn.nodemanager.aux-services.spark_shuffle.class</name>
<value>org.apache.spark.network.yarn.YarnShuffleService</value>
<description>dynamic allocation class.(default:x)</description>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/home/hduser/hadoop_store/nmlocal</value>
<description>Comma-separated list of paths on the local filesystem where intermediate data is written. (default:${hadoop.tmp.dir}/nm-local-dir)</description>
</property>
<property>
<name>yarn.nodemanager.log-dirs</name>
<value>/home/hduser/hadoop_store/nmlogs</value>
<description>Comma-separated list of paths on the local filesystem where logs are written. (default:${yarn.log.dir}/userlogs)</description>
</property>
<property>
<name>yarn.nodemanager.log.retain-seconds</name>
<value>10800</value>
<description>Default time (in seconds) to retain log files on the NodeManager Only applicable if log-aggregation is disabled. (defaults:10800)</description>
</property>
<property>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>/app-logs</value>
<description>HDFS directory where the application logs are moved on application completion. Need to set appropriate permissions.
Only applicable if log-aggregation is enabled. (defaul$
</property>
<property>
<name>yarn.nodemanager.remote-app-log-dir-suffix</name>
<value>logs</value>
<description>Suffix appended to the remote log dir, ${yarn.nodemanager.remote-app-log-dir}/${user}/${thisParam}. (default:logs)</description>
</property>


<!-- Configurations for Yarn Scheduler -->
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1024</value>
<description>Minimum limit of memory to allocate to each container request at the Resource Manager, in MBs. (default:1024)</description>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>204800</value>
<description>Maximum limit of memory to allocate to each container request at the Resource Manager, in MBs. (default:8192)</description>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-vcores</name>
<value>1</value>
<description>The minimum allocation for every container request at the RM, in terms of virtual CPU cores. (default:1)</description>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>24</value>
<description>The maximum allocation for every container request at the RM, in terms of virtual CPU cores. (default:32)</description>
</property>


<!-- Auto allocation for ResourceManager -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>
server1</value>
<description>The hostname of the RM. (default:0.0.0.0)</description>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>
server1:8032</value>
<description>ResourceManager host:port for clients to submit jobs. (default:${yarn.resourcemanager.hostname}:8032)</description>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>
server1:8030</value>
<description>ResourceManager host:port for ApplicationMasters to talk to Scheduler to obtain resources.
(default:${yarn.resourcemanager.hostname}:8030)</description>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>
server1:8031</value>
<description>ResourceManager host:port for NodeManagers. (default:${yarn.resourcemanager.hostname}:8031)</description>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>
server1:8088</value>
<description>ResourceManager web-ui host:port. (default:${yarn.resourcemanager.hostname}:8088)</description>
</property>

<!-- Configurations for Log -->
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>259200</value>
<description>How long to keep aggregation logs. Used by History Server.</description>
</property>
<property>
<name>yarn.log-aggregation.retain-check-interval-seconds</name>
<value>3600</value>
<description>Time between checks for aggregated log retention. Used by History Server.</description>
</property>
</configuration>


mapred-site.xml
<configuration>
<!-- Configurations for MapReduce Applications -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.map.cpu.vcores</name>
<value>1</value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>2048</value>
<description>yarn.scheduler.minimum-allocation-mb * 2.
The amount of memory to request from the scheduler for each map task.(default:1024)</description>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx1638m</value>
<description>mapreduce.map.memory.mb x 80%. </description>
</property>
<property>
<name>mapreduce.reduce.cpu.vcores</name>
<value>1</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>4096</value>
<description>yarn.scheduler.minimum-allocation-mb * 4.
The amount of memory to request from the scheduler for each reduce task.(default:1024)</description>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx3276m</value>
<description>mapreduce.reduce.memory.mb x 80%. </description>
</property>

<!-- MapReduce ApplicationMaster CONFIGURATION -->
<property>
<name>yarn.app.mapreduce.am.resource.cpu-vcores</name>
<value>1</value>
</property>
<property>
<name>yarn.app.mapreduce.am.resource.mb</name>
<value>1536</value>
<description>The amount of memory the MR AppMaster needs.(default:1536)</description>
</property>
<property>
<name>yarn.app.mapreduce.am.command-opts</name>
<value>-Xmx1024m</value>
</property>
<property>
<name>yarn.app.mapreduce.am.staging-dir</name>
<value>/yarn.staging</value>
<description>The staging dir used while submitting jobs.(default:/tmp/hadoop-yarn/staging)</description>
</property>

<!-- Configurations for MapReduce JobHistory Server -->
<property>
<name>mapreduce.jobhistory.address</name>
<value>
server1:10020</value>
<description>MapReduce JobHistory Server host:port. (default:10020)</description>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>
server1:19888</value>
<description>MapReduce JobHistory Server Web UI host:port. (default:19888)</description>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx2048m</value>
<description>Java opts for the task tracker child processes. (default:-Xmx200m)</description>
</property>
</configuration>


capacity-scheduler.xml 수정 (ref. http://tobby48.egloos.com/4411908)
<property>
<name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
<value>0.5</value>
<description>
Maximum percent of resources in the cluster which can be used to run
application masters i.e. controls number of concurrent running
applications.
</description>
</property>



* Spark

spark-defaults.conf


# Default system properties included when running spark-submit.
# This is useful for setting default environmental settings.
#####################################################################################################################################################################
# SPARK event & default settings
#####################################################################################################################################################################
spark.eventLog.enabled true
spark.eventLog.dir hdfs://server1:54310/spark-history
spark.eventLog.compress true
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.allowMultipleContexts true
spark.kryoserializer.buffer 512m
spark.kryoserializer.buffer.max 1024m

#####################################################################################################################################################################
# App Cluster mode
#####################################################################################################################################################################
# spark.driver.extraClassPath /home/hduser/xxx.jar
# spark.executor.extraClassPath /home/hduser/xxx.jar
# spark.yarn.jars hdfs://xxx.xxx.xxx.xxx:54310/user/hduser/xxx.jar

#####################################################################################################################################################################
# Memory
#####################################################################################################################################################################
spark.memory.fraction 0.6
spark.memory.storageFraction 0.5

#####################################################################################################################################################################
# Dynamic Resource
#####################################################################################################################################################################
# need
spark.dynamicAllocation.enabled true
spark.shuffle.service.enabled true
spark.dynamicAllocation.minExecutors 3
spark.dynamicAllocation.maxExecutors 6
# spark.dynamicAllocation.initialExecutors 1
# optional
spark.dynamicAllocation.executorIdleTimeout 60s
spark.dynamicAllocation.schedulerBacklogTimeout 5s
spark.dynamicAllocation.sustainedSchedulerBacklogTimeout 5s

#####################################################################################################################################################################
# Tuning
#####################################################################################################################################################################
spark.driver.maxResultSize 0
spark.sql.broadcastTimeout 1200
spark.network.timeout 700
spark.sql.join.preferSortMergeJoin false

#####################################################################################################################################################################
# SPARK log settings
#####################################################################################################################################################################
spark.executor.extraJavaOptions -Dlog4j.configuration=file:/home/hduser/spark-2.0.1-bin-hadoop2.7/conf/log4j.properties
spark.driver.extraJavaOptions -Dlog4j.configuration=file:/home/hduser/spark-2.0.1-bin-hadoop2.7/conf/log4j.properties

#####################################################################################################################################################################
#YARN default settings (ignore if not using YARN)
#####################################################################################################################################################################
spark.yarn.submit.file.replication 1
spark.yarn.preserve.staging.files false
spark.yarn.scheduler.heartbeat.interval-ms 5000
spark.yarn.queue default
spark.yarn.containerLauncherMaxThreads 25
spark.yarn.submit.waitAppCompletion true



spark-env.sh
#!/usr/bin/env bash

#####################################################################################################################################################################
#SPARK global settings
#####################################################################################################################################################################
# GLOBAL
export JAVA_HOME=/home/hduser/jdk1.8.0_73
export SCALA_HOME=/home/hduser/scala-2.11.8
export HADOOP_HOME=/home/hduser/hadoop-2.7.3
export SPARK_HOME=/home/hduser/spark-2.0.1-bin-hadoop2.7
export SPARK_SCALA_VERSION=2.11

export SPARK_DAEMON_MEMORY=4g

# JOB HISTORY
export SPARK_HISTORY_OPTS="-Dspark.history.ui.port=18088 -Dspark.history.retainedApplications=50
 -Dspark.history.fs.logDirectory=hdfs://server1:54310/spark-history"

#####################################################################################################################################################################
#SPARK on YARN
#####################################################################################################################################################################
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
#export SPARK_JAR=hdfs://xxx.xxx.xxx.xxx:54310/user/hduser/share/lib/spark/spark-assembly-1.6.0-hadoop2.6.0.jar"
#export SPARK_YARN_USER_ENV=JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH,LD_LIBRARY_PATH=$LD_LIBRARY_PATH"

#export SPARK_DIST_CLASSPATH=$($HADOOP_HOME/bin/hadoop classpath)
#export SPARK_DIST_CLASSPATH=$SPARK_DIST_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/*"
#####################################################################################################################################################################
#SPARK classpath settings(hadoop & spark)
#####################################################################################################################################################################



* Hadoop 수행
start-all.sh
* Hadoop job history 수행
mr-jobhistory-daemon.sh start historyserver
* Spark job history 수행
/home/hduser/spark-2.1.0-bin-hadoop2.7/sbin/start-history-server.sh


* 개별 서버에 jps 수행 시, (red : master role, blue : slave role)
  • 1번서버 : ResourceManager, SecondaryNameNode, NameNode, DataNode, NodeManager, JobHistoryServer, HistoryServer
  • 2번서버 : DataNode, NodeManager

* JobHistoryServer 는 8088의 Hadoop Yarn 모니터링이며, HistoryServer 는 Spark UI

* Hdfs 관련(NameNode, DataNode), 실제 Spark app이 수행되는 Yarn Container 들이 배치되는 영역은 NodeManager)


덧글

댓글 입력 영역