目的

透過 zookeeper 針對 HDFS 的 namenode 及 Yarn 的 resourcemanager 的 High Availability 機制.

  • 3 台架構
    hadoop_day1_1.jpg
  • 5 台架構
    hadoop_day1_2.jpg

整合步驟 :

1.install & config Zookeeper
2.Hadoop with Zookeeper
3.HDFS Nodemanager HA by Zookeeper
4.Yarn Resourcemanager HA by Zookeeper
5.Initializing HA state in ZooKeeper
6.Start HA cluster

原來的 active 掛後,修復步驟參考 :

1.restart fail active machine

目前遇到的問題參考 :

1.問題排除

install & config Zookeeper

下載 zookeeper download page 解壓縮後,到 conf 底下,將zoo_sample.cfg 複製成 zoo.cfg

  • 修改 zoo.cfg
    修改 data 存放的目錄
    dataDir=/opt/zookeeper-3.4.13/data  
    

    加上 zookeeper 的 cluster

    server.1=dmpn1:2888:3888
    server.2=dmpn2:2888:3888
    server.3=dmpn3:2888:3888
    

    範例參考 :

    tickTime=2000
    initLimit=10
    syncLimit=5
    dataDir=/opt/zookeeper-3.4.13/data
    clientPort=2181
    server.1=dmpn1:2888:3888
    server.2=dmpn2:2888:3888
    server.3=dmpn3:2888:3888
    
  • 建立 myid 檔案 在 zoo.cfg 設定的 dataDir 底下建立 myid 檔案
    dataDir=/opt/zookeeper-3.4.13/data
    

    myid 檔案內容就根據 zoo.cfg 的設定

    server.1=dmpn1:2888:3888
    server.2=dmpn2:2888:3888
    server.3=dmpn3:2888:3888
    

    第一台就寫 1 , 第二台寫 2 , 第三台寫 3

Hadoop with Zookeeper

修改 core-site.xml(/opt/hadoop-2.9.0/etc/hadoop)

<property>
  <name>fs.defaultFS</name>
  <value>hdfs://dmpcluster</value>
</property>
<property>
  <name>dfs.journalnode.edits.dir</name>
  <value>/opt/zookeeper-3.4.13/journaldata</value>
</property>
<property>
   <name>ha.zookeeper.quorum</name>
   <value>dmpn1:2181,dmpn2:2181,dmpn3:2181</value>
</property>

Note :

  • JournalNodes 至少要啟動 3 台,奇數台(3,5,7,..),N 台 JournalNodes 則系統能容忍至少 (N-1)/2 台機器掛掉.JournalNodes 的功用有點像是沒有 HA 時的 secondary namenode.
  • 在 HA 的 cluster 的環境中 standby Namenode 已經做了 namespace 狀態的 checkpoints 機制,所以不需要啟動 Secondary NameNode, CheckpointNode 或 BackupNode.
  • NameNode 能夠自動切換的核心是透過 ZKFC (ZKFC:DFSZKFailoverController),ZKFC 會定期的傳送 Namenode 的狀況給 Zookeeper. KFC 的 HealthMonitor 是監控 Namenode,ActiveStandbyElector 是監控 Zookeeper 的 node.
  • 當一個 Namenode 成功切換為 Active 時,Zookeeper 會建立一個 znode 來保留當前 Active Namenode 的一些資訊.當 Active Namenode 掛掉時,會把 znode 刪除並觸發下一次 Active Namenode. 所以 ZK 可以保證最多只能有一個節點能夠成功建立 znode 成為當前的 Active Namenode.

HDFS Nodemanager HA by Zookeeper

修改 hdfs-site.xml (/opt/hadoop-2.9.0/etc/hadoop)

<property>
  <name>dfs.nameservices</name>
  <value>dmpcluster</value>
</property>
<property>
  <name>dfs.ha.namenodes.dmpcluster</name>
  <value>nn1,nn2</value>
</property>
<property>
  <name>dfs.namenode.rpc-address.dmpcluster.nn1</name>
  <value>dmpn1:8020</value>
</property>
<property>
  <name>dfs.namenode.rpc-address.dmpcluster.nn2</name>
  <value>dmpn2:8020</value>
</property>
<property>
  <name>dfs.namenode.http-address.dmpcluster.nn1</name>
  <value>dmpn1:50070</value>
</property>
<property>
  <name>dfs.namenode.http-address.dmpcluster.nn2</name>
  <value>dmpn2:50070</value>
</property>
<property>
  <name>dfs.namenode.shared.edits.dir</name>
  <value>qjournal://dmpn1:8485;dmpn2:8485;dmpn3:8485/dmpcluster</value>
</property>
<property>
  <name>dfs.client.failover.proxy.provider.dmpcluster</name>
  <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<property>
  <name>dfs.ha.fencing.methods</name>
  <value>sshfence</value>
</property>
<property>
  <name>dfs.ha.fencing.ssh.private-key-files</name>
  <value>/home/miuser/.ssh/id_rsa</value>
</property>
<property>
   <name>dfs.ha.automatic-failover.enabled</name>
   <value>true</value>
</property>

原本 fencing 這樣設定,如果只用 kill -9 {namenodePID} 可以 HA,但機器關掉後沒辦法 HA

<property>
  <name>dfs.ha.fencing.methods</name>
  <value>sshfence</value>
</property>

所以改成

<property>
  <name>dfs.ha.fencing.methods</name>
  <value>shell(/bin/true)</value>
</property>

dfs.namenode.shared.edits.dir 這個參數是在設定 2 台 namenode 的 metadata 要能夠同步. active 會不斷寫入,而 standby 會一直讀取更新 metadata.而這邊的設定是使用 Quorum Journal Manager 來管理.

<property>
  <name>dfs.namenode.shared.edits.dir</name>
  <value>qjournal://dmpn1:8485;dmpn2:8485;dmpn3:8485/dmpcluster</value>
</property>

另外一種方式是用 NFS(Network FileSystem) 的方式同步 metadata.讓 namenode 可以讀取到同一個目錄.

<property>
  <name>dfs.namenode.shared.edits.dir</name>
  <value>file:///mnt/filer1/dfs/ha-name-dir-shared</value>
</property>

dfs.ha.automatic-failover.enabled 設為 true 時,start-dfs.sh 或 start-all.sh 時會自動帶起 zkfc daemon(DFSZKFailoverController).

<property>
   <name>dfs.ha.automatic-failover.enabled</name>
   <value>true</value>
</property>

如果沒有自動帶起的話可以手動起,只要起在 2 台 Namenode 的機器上即可

hadoop-daemon.sh start zkfc

Yarn Resourcemanager HA by Zookeeper

修改 yarn-site.xml

<property>
  <name>yarn.resourcemanager.ha.enabled</name>
  <value>true</value>
</property>
<property>
  <name>yarn.resourcemanager.cluster-id</name>
  <value>dmpcluster</value>
</property>
<property>
  <name>yarn.resourcemanager.ha.rm-ids</name>
  <value>rm1,rm2</value>
</property>
<property>
  <name>yarn.resourcemanager.hostname.rm1</name>
  <value>dmpn1</value>
</property>
<property>
  <name>yarn.resourcemanager.hostname.rm2</name>
  <value>dmpn2</value>
</property>
<property>
  <name>yarn.resourcemanager.webapp.address.rm1</name>
  <value>dmpn1:8088</value>
</property>
<property>
  <name>yarn.resourcemanager.webapp.address.rm2</name>
  <value>dmpn2:8088</value>
</property>
<property>
  <name>yarn.resourcemanager.scheduler.address.rm1</name>
  <value>dmpn1:8030</value>
</property>
<property>
  <name>yarn.resourcemanager.scheduler.address.rm2</name>
  <value>dmpn2:8030</value>
</property>
<property>
  <name>yarn.resourcemanager.resource-tracker.address.rm1</name>
  <value>dmpn1:8031</value>
</property>
<property>
  <name>yarn.resourcemanager.resource-tracker.address.rm2</name>
  <value>dmpn2:8031</value>
</property>
<property>
  <name>yarn.resourcemanager.admin.address.rm1</name>
  <value>dmpn1:8033</value>
</property>
<property>
  <name>yarn.resourcemanager.admin.address.rm2</name>
  <value>dmpn2:8033</value>
</property>
<property>
  <name>yarn.resourcemanager.zk-address</name>
  <value>dmpn1:2181,dmpn2:2181,dmpn3:2181</value>
</property>

Initializing HA state in ZooKeeper

記得要先啟動 zookeeper 才可以把 znode 加入.

hdfs zkfc -formatZK

這邊如果遇到啟動不了的話,通常都是防火牆跟 /etc/hosts 的問題,所以要先確認這兩個地方 :

Unable to start failover controller. Unable to connect to ZooKeeper quorum at node1:2181,node2:2181,node3:2181.

Start HA cluster

1.在每台機器(dn1,dn2,dn3)啟動 zookeeper

zkServer.sh start

2.啟動每台機器(dn1,dn2,dn3)的journalnode,不然執行 hdfs namenode -format 時會出現 8485 port 連不到

hadoop-daemon.sh start journalnode

3.到 dn1 那台 format namenode

hdfs namenode -format

4.在 dn1 啟動 cluster

start-all.sh

5.在 dn2 的 namenode 下 bootstrapStandby

hdfs namenode -bootstrapStandby

6.再啟動 dn2 standby 的 namenode

hadoop-daemon.sh start namenode

7.啟動 dn2 的 Yarn Resourcemanager

yarn-daemon.sh start resourcemanager

restart fail active machine

將原來掛掉的 active namenode (dn1) 重新啟動修復步驟 :

1.啟動 zookeeper

./zkServer.sh start

2.啟動 DFSZKFailoverController

hadoop-daemon.sh start zkfc

3.啟動 journalnode

hadoop-daemon.sh start journalnode

4.啟動 namenode

hadoop-daemon.sh start namenode

5.啟動 datanode

hadoop-daemon.sh start datanode

6.啟動 resourcemanager

yarn-daemon.sh start resourcemanager

7.啟動 nodemanager

yarn-daemon.sh start nodemanager

8.確認 namenode active / standby 狀態

hdfs haadmin -getServiceState nn1
hdfs haadmin -getServiceState nn2

9.確認 resourcemanager active / standby 狀態

yarn rmadmin -getServiceState rm1
yarn rmadmin -getServiceState rm2

10.確認 cat hdfs 檔案是否正常

hdfs dfs -cat /temp/wordcount/workcount.txt

11.測試 spark-submit 的 yarn cluster mode 是否正常

spark-submit --class com.example.job.WordCount --master yarn --deploy-mode cluster wordcount.jar /temp/wordcount/workcount.txt /temp/data

問題排除

[問題] HA 2 台 namenode 的資料(hdfs資料)不一致的問題,或著遇到 active 掛掉 standby 啟動時 hdfs cat 檔案時出現下列錯誤訊息

2018-10-02 15:14:02 WARN  DFSClient:981 - Could not obtain block: BP-106148578-192.168.3.13-1538386246613:blk_1073741825_1001 file=/temp/wordcount/workcount.txt No live nodes contain current block Block locations: Dead nodes: . Throwing a BlockMissingException
2018-10-02 15:14:02 WARN  DFSClient:907 - DFS Read
org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-106148578-192.168.3.13-1538386246613:blk_1073741825_1001 file=/temp/wordcount/workcount.txt
  at org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:984)
  at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:642)
  at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:882)
  at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934)
  at java.io.DataInputStream.read(DataInputStream.java:149)
  at org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.fillBuffer(UncompressedSplitLineReader.java:62)
  at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216

確認 2 台 namenode 的 metadata 的同步設定.
如果是用 JournalNode 要用 jps 檢查是否有啟動.
如果是用 NFS 要確認 2 台 nmaenode 是否有讀取到 mount 的同一目錄.

  • 檢查 hdfs-site.xml

[問題] HA 切換時 datanode 好像會掛掉,或著出現類似下面錯誤訊息

2018-09-27 19:18:29,634 WARN org.apache.hadoop.hdfs.server.common.Storage: Failed to add storage directory [DISK]file:/data/hadoop/dfs/data/
java.io.IOException: Incompatible clusterIDs in /data/hadoop/dfs/data: namenode clusterID = CID-92ddbfd6-6a07-454b-8848-22f9f15d2621; datanode clusterID = CID-4f03e59a-ea80-42a5-8a7d-a6db94c0cc19
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:760)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.loadStorageDirectory(DataStorage.java:293)

或著

2018-09-27 19:18:17,387 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception loading fsimage
java.io.FileNotFoundException: /data/hadoop/dfs/name/current/VERSION (Permission denied)
        at java.io.RandomAccessFile.open0(Native Method)
        at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
        at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243)
        at org.apache.hadoop.hdfs.server.common.StorageInfo.readPropertiesFile(StorageInfo.java:244)
        at org.apache.hadoop.hdfs.server.namenode.NNStorage.readProperties(NNStorage.java:650)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverStorageDirs(FSImage.java:381)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:220)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1048)

HA 機制沒同步好,standby 的 namenode 可能沒有下 bootstrapStandby 下成 namenode -format.導致 namnode 與 datanode 的 version 沒有對到.
1.先把服務都關掉,然後清空每台的暫存資料 :

sudo rm -fr /data/hadoop/dfs/name/current
sudo rm -fr /data/hadoop/dfs/data/current
rm -fr /tmp/hadoop*

2.啟動 journalnode

hadoop-daemon.sh start journalnode

3.format active 的 namenode :

hdfs namenode -format

4.在 active 的機器上把服務起起來

start-all.sh

5.在 standby 下 bootstrapStandby

hdfs namenode -bootstrapStandby

6.再啟動 standby 的 namenode

./hadoop-daemon.sh start namenode

修改 hadoop temp 目錄的路徑

在 core-site.xml 加上要設定的路徑(預設是/tmp/hadoop/hadoop-${user.name}) :

<property>
    <name>hadoop.tmp.dir</name>
    <value>/opt/hadoop-2.9.0/tmp/hadoop/hadoop-${user.name}</value>
    <description>A base for other temporary directories.</description>
</property>

[問題] 出現下列錯誤訊息

2018-11-27 12:10:38,505 WARN org.apache.hadoop.hdfs.server.common.Storage: Failed to add storage directory [DISK]file:/data/hadoop/dfs/data/
java.io.FileNotFoundException: /data/hadoop/dfs/data/in_use.lock (Permission denied)
        at java.io.RandomAccessFile.open0(Native Method)
        at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
        at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243)
        at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.tryLock(Storage.java:782)
        at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:754)
        at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:567)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.loadStorageDirectory(DataStorage.java:270)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.loadDataStorage(DataStorage.java:409)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:388)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:556)

[解決] :
刪掉 in_use.lock

sudo rm -fr /data/hadoop/dfs/data/in_use.lock

[問題] datanode 起不來

java.net.BindException: Problem binding to [0.0.0.0:50010] java.net.BindException: Address already in use; For more details see:  http://wiki.apache.org/hadoop/BindException
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:824)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:735)
        at org.apache.hadoop.ipc.Server.bind(Server.java:561)
        at org.apache.hadoop.ipc.Server.bind(Server.java:533)
        at org.apache.hadoop.hdfs.net.TcpPeerServer.<init>(TcpPeerServer.java:52)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initDataXceiver(DataNode.java:1116)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:1369)

[解決] : 原本的 user(miuser) 底下都找不到被 bind 的 50010

netstat -anp | grep 50010
netstat -ano | grep 50010
ps -aux | grep 50010
pgrep -f datanode
pgrep -f journalnode

後來切換到另一個 user(yarn),有找到一個 process 佔用該 port

yarn@dmpn5:~>netstat -ano | grep 50010
tcp        0      0 0.0.0.0:50010           0.0.0.0:*               LISTEN      off (0.00/0/0)

找有關 datanode 的 PID

yarn@dmpn5:~>pgrep -f datanode
159789

把佔用的 port 刪掉

yarn@dmpn5:~>ps -aux | grep 159789
yarn      61224  0.0  0.0 112616   732 pts/1    S+   12:45   0:00 grep --color=auto 159789
root     159789  0.1  0.3 3039556 878360 ?      Sl   May15 464:44 /usr/lib/jvm/jdk1.8.0_151//bin/java -Dproc_datanode -Xmx1000m -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/opt/hadoop-2.9.0/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/opt/hadoop-2.9.0 -Dhadoop.id.str=root -Dhadoop.root.logger=INFO,console -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true -Dhadoop.log.dir=/opt/hadoop-2.9.0/logs -Dhadoop.log.file=hadoop-root-datanode-dmpn5.log -Dhadoop.home.dir=/opt/hadoop-2.9.0 -Dhadoop.id.str=root -Dhadoop.root.logger=INFO,RFA -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -server -Dhadoop.security.logger=ERROR,RFAS -Dhadoop.security.logger=ERROR,RFAS -Dhadoop.security.logger=ERROR,RFAS -Dhadoop.security.logger=INFO,RFAS org.apache.hadoop.hdfs.server.datanode.DataNode

sudo kill -9 159789

如果有用到 external shuffle service 的話要記得 :

Put spark-2.1.1-yarn-shuffle.jar into lib folder of yarn.

cp <SPARK_HOME>/yarn/spark-<version>-yarn-shuffle.jar /opt/hadoop-2.7.3/share/hadoop/yarn/lib/

[問題] namenode 掛掉

2019-04-09 19:12:29,121 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 17015 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [192.168.6.31:8485]
2019-04-09 19:12:30,122 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 18016 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [192.168.6.31:8485]
2019-04-09 19:12:31,123 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Waited 19017 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [192.168.6.31:8485]
2019-04-09 19:12:32,107 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [192.168.6.31:8485, 192.168.6.32:8485, 192.168.6.33:8485], stream=QuorumOutputStream starting at txid 104137))
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
        at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
        at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:107)
        at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
        at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:533)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:393)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:57)
        at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:529)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:707)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.run(FSEditLogAsync.java:188)
        at java.lang.Thread.run(Thread.java:748)
2019-04-09 19:12:32,108 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Aborting QuorumOutputStream starting at txid 104137

有可能對 HDFS 大量操作時,或網路問題狀況出現

[解決] : 修改 hdfs-site.xml

<property>
        <name>dfs.qjournal.write-txns.timeout.ms</name>
        <value>60000</value>
</property>
<property>
        <name>dfs.qjournal.start-segment.timeout.ms</name>
        <value>60000</value>
</property>
<property>
        <name>dfs.qjournal.select-input-streams.timeout.ms</name>
        <value>60000</value>
</property>

修改core-site.xml

<property>
    <name>ipc.client.connect.timeout</name>
    <value>60000</value>
</property>

指令速查

  • 查看所有 node 的狀態
    hdfs dfsadmin -report
    
  • 查看 zookeeper 的 leader 或 follower
    zkServer.sh status
    
  • 查看正在 listen 的 port
    lsof -i -P -n | grep LISTEN
    sudo netstat -plten | grep LISTEN -> 要 root
    
  • 啟動、關閉、查看防火牆狀態(centos 7)
    systemctl status firewalld
    systemctl start firewalld
    systemctl stop firewalld
    
  • 查看目前防火墻設定的 port
    sudo firewall-cmd --zone=public --list-ports
    
  • 加上防火牆的 port,並 reload
    sudo firewall-cmd --zone=public --permanent --add-port=8088/tcp
    firewall-cmd --reload
    
  • gen rsa key 及複製到各機器(ssh 免密碼登入)
    su - muser
    ssh-keygen -t rsa
    ssh-copy-id -i ~/.ssh/id_rsa.pub muser@dn1
    ssh-copy-id -i ~/.ssh/id_rsa.pub muser@dn2
    ssh-copy-id -i ~/.ssh/id_rsa.pub muser@dn3
    
  • hdfs HA 相關指令
    hdfs haadmin -failover nn1 nn2
    hdfs haadmin -transitionToActive --forcemanual nn1
    hdfs haadmin -getServiceState nn1
    hdfs haadmin -transitionToActive --forcemanual --forceactive nn1
    hdfs haadmin -transitionToStandby --forcemanual nn2
    
  • yarn HA 相關指令
    yarn rmadmin -getServiceState rm1
    yarn rmadmin -getServiceState rm2
    
  • yarn service 相關指令
    yarn-daemon.sh start resourcemanager
    yarn-daemons.sh start nodemanager
    yarn resourcemanager -format-state-store
    
  • HDFS service 相關指令
    hadoop-daemon.sh start namenode
    hadoop-daemon.sh start datanode
    hadoop-daemon.sh start journalnode
    hadoop-daemon.sh start zkfc
    

實際設定值參考

  • core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://dmpcluster</value>
    </property>
    <property>
        <name>dfs.journalnode.edits.dir</name>
        <value>/opt/zookeeper-3.4.13/journaldata</value>
    </property>
    <property>
        <name>ha.zookeeper.quorum</name>
        <value>dmpn1:2181,dmpn2:2181,dmpn3:2181</value>
    </property>
    <property>
        <name>dfs.permissions</name>
        <value>false</value>
    </property>
    <property>
        <name>fs.trash.interval</name>
        <value>1440</value>
    </property>
</configuration>
  • hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
          <name>dfs.name.dir</name>
          <value>file:///data/hadoop/dfs/name</value>
          <final>true</final>
    </property>
    <property>
          <name>dfs.data.dir</name>
          <value>file:///data/hadoop/dfs/data</value>
          <final>true</final>
    </property>
    <property>
          <name>dfs.replication</name>
          <value>3</value>
    </property>
    <property>
          <name>dfs.blocksize</name>
          <value>134217728</value>
    </property>
    <property>
          <name>dfs.namenode.checkpoint.dir</name>
          <value>file:///data/def/namesecondary</value>
    </property>
    <property>
          <name>dfs.nameservices</name>
          <value>dmpcluster</value>
    </property>
    <property>
          <name>dfs.ha.namenodes.dmpcluster</name>
          <value>nn1,nn2</value>
    </property>
    <property>
          <name>dfs.namenode.rpc-address.dmpcluster.nn1</name>
          <value>dmpn1:8020</value>
    </property>
    <property>
          <name>dfs.namenode.rpc-address.dmpcluster.nn2</name>
          <value>dmpn2:8020</value>
    </property>
    <property>
          <name>dfs.namenode.http-address.dmpcluster.nn1</name>
          <value>dmpn1:50070</value>
    </property>
    <property>
          <name>dfs.namenode.http-address.dmpcluster.nn2</name>
          <value>dmpn2:50070</value>
    </property>
    <property>
          <name>dfs.namenode.shared.edits.dir</name>
          <value>qjournal://dmpn1:8485;dmpn2:8485;dmpn3:8485/dmpcluster</value>
    </property>
    <property>
          <name>dfs.client.failover.proxy.provider.dmpcluster</name>
          <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
    </property>
    <property>
          <name>dfs.ha.fencing.methods</name>
          <value>shell(/bin/true)</value>
    </property>
    <property>
          <name>dfs.ha.fencing.ssh.private-key-files</name>
          <value>/home/miuser/.ssh/id_rsa</value>
    </property>
    <property>
          <name>dfs.ha.automatic-failover.enabled</name>
          <value>true</value>
    </property>
</configuration>
  • yarn-site.xml
<?xml version="1.0"?>
<configuration>
  <property>
      <name>yarn.resourcemanager.hostname</name>
      <value>dmpn1</value>
  </property>
<property>
    <name>yarn.nodemanager.vmem-pmem-ratio</name>
    <value>3</value>
</property>
  <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
  </property>
  <property>
      <name>yarn.scheduler.maximum-allocation-mb</name>
      <value>24576</value>
  </property>
  <property>
      <name>yarn.scheduler.maximum-allocation-vcores</name>
      <value>18</value>
  </property>
  <property>
      <name>yarn.nodemanager.resource.memory-mb</name>
      <value>8192</value>
  </property>
  <property>
      <name>yarn.nodemanager.resource.cpu-vcores</name>
      <value>6</value>
  </property>
  <property>
      <name>yarn.log-aggregation-enable</name>
      <value>true</value>
  </property>
  <property>
      <name>yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds</name>
      <value>3600</value>
  </property>
  <property>
      <name>yarn.log-aggregation.retain-seconds</name>
      <value>1209600</value>
  </property>
  <property>
      <name>yarn.log-aggregation.retain-check-interval-seconds</name>
      <value>172800</value>
  </property>
  <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>spark_shuffle</value>
  </property>
  <property>
      <name>yarn.nodemanager.aux-services.spark_shuffle.class</name>
      <value>org.apache.spark.network.yarn.YarnShuffleService</value>
  </property>
<property>
  <name>yarn.resourcemanager.ha.enabled</name>
  <value>true</value>
</property>
<property>
  <name>yarn.resourcemanager.cluster-id</name>
  <value>dmpcluster</value>
</property>
<property>
  <name>yarn.resourcemanager.ha.rm-ids</name>
  <value>rm1,rm2</value>
</property>
<property>
  <name>yarn.resourcemanager.hostname.rm1</name>
  <value>dmpn1</value>
</property>
<property>
  <name>yarn.resourcemanager.hostname.rm2</name>
  <value>dmpn2</value>
</property>
<property>
  <name>yarn.resourcemanager.webapp.address.rm1</name>
  <value>dmpn1:8088</value>
</property>
<property>
  <name>yarn.resourcemanager.webapp.address.rm2</name>
  <value>dmpn2:8088</value>
</property>
<property>
  <name>yarn.resourcemanager.scheduler.address.rm1</name>
  <value>dmpn1:8030</value>
</property>
<property>
  <name>yarn.resourcemanager.scheduler.address.rm2</name>
  <value>dmpn2:8030</value>
</property>
<property>
  <name>yarn.resourcemanager.resource-tracker.address.rm1</name>
  <value>dmpn1:8031</value>
</property>
<property>
  <name>yarn.resourcemanager.resource-tracker.address.rm2</name>
  <value>dmpn2:8031</value>
</property>
<property>
  <name>yarn.resourcemanager.admin.address.rm1</name>
  <value>dmpn1:8033</value>
</property>
<property>
  <name>yarn.resourcemanager.admin.address.rm2</name>
  <value>dmpn2:8033</value>
</property>
<property>
  <name>yarn.resourcemanager.zk-address</name>
  <value>dmpn1:2181,dmpn2:2181,dmpn3:2181</value>
</property>
</configuration>
  • zoo.cfg
tickTime=2000
initLimit=100
syncLimit=5
dataDir=/opt/zookeeper-3.4.13/data
clientPort=2181
server.1=dmpn1:2888:3888
server.2=dmpn2:2888:3888
server.3=dmpn3:2888:3888
  • Java Namenode HA 寫法 :
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;


FileSystem hdfs;
Configuration conf = new Configuration();
conf.set("fs.defaultFS", "hdfs://nameservice1");
conf.set("dfs.nameservices", "nameservice1");
conf.set("dfs.ha.namenodes.nameservice1", "namenode76,namenode81");
conf.set("dfs.namenode.rpc-address.nameservice1.namenode76", "darhhdlm1:8020");
conf.set("dfs.namenode.rpc-address.nameservice1.namenode81", "darhhdlm2:8020");
conf.set("dfs.client.failover.proxy.provider.nameservice1", "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider");
conf.set("fs.hdfs.impl", "org.apache.hadoop.hdfs.DistributedFileSystem");
hdfs = FileSystem.get(conf);

參考資料
HDFS namenode HA using QJM
HDFS namenode HA using NFS
Yarn ResourceManager HA ZKFC 原理 Hadoop HA 筆記
NameNode HA
Zookeeper getting start