HDFS Datanode Volume Failures 问题解决

2021/08/13

从 hdfs web 面板上看到的报错信息,明确指出了出错的 datanode 以及数据位置。

参考官方文档的磁盘热更换教程

https://hadoop.apache.org/docs/r2.7.5/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html#DataNode_Hot_Swap_Drive

首先登录目标 datanode 所在的主机 (192.168.34.30),修改 hdfs-site.xml 配置文件

vim /usr/local/hadoop/etc/hadoop/hdfs-site.xml

找到 dfs.datanode.data.dir (也可能是弃用的 dfs.data.dir

在 value 中移除故障目录

<property>
  <name>dfs.data.dir</name>
-  <value>/disk/sata1/hdfs/data,/disk/sata2/hdfs/data,/disk/sata3/hdfs/data,/disk/sata4/hdfs/data,/disk/sata5/hdfs/data,/disk/sata6/hdfs/data,/disk/sata7/hdfs/data,/disk/sata8/hdfs/data,/disk/sata9/hdfs/data,/disk/sata10/hdfs/data,/disk/sata11/hdfs/data,/disk/sata12/hdfs/data</value>
+  <value>/disk/sata1/hdfs/data,/disk/sata2/hdfs/data,/disk/sata3/hdfs/data,/disk/sata4/hdfs/data,/disk/sata5/hdfs/data,/disk/sata6/hdfs/data,/disk/sata7/hdfs/data,/disk/sata8/hdfs/data,/disk/sata9/hdfs/data,/disk/sata10/hdfs/data,/disk/sata11/hdfs/data</value>
</property>

Hadoop 集群中任意节点上运行管理命令 /usr/local/hadoop/bin/hdfs

# 重载配置文件
hdfs dfsadmin -reconfig datanode 192.168.34.30:50020 start

# 查看状态
hdfs dfsadmin -reconfig datanode 192.168.34.30:50020 status

出现如下日志,表示修改成功。

SUCCESS: Change property dfs.datanode.data.dir
        From: "[DISK]file:/disk/sata1/hdfs/data/,[DISK]file:/disk/sata2/hdfs/data/,[DISK]file:/disk/sata3/hdfs/data/,[DISK]file:/disk/sata4/hdfs/data/,[DISK]file:/disk/sata5/hdfs/data/,[DISK]file:/disk/sata6/hdfs/data/,[DISK]file:/disk/sata7/hdfs/data/,[DISK]file:/disk/sata8/hdfs/data/,[DISK]file:/disk/sata9/hdfs/data/,[DISK]file:/disk/sata10/hdfs/data/,[DISK]file:/disk/sata11/hdfs/data/"
        To: "/disk/sata1/hdfs/data,/disk/sata2/hdfs/data,/disk/sata3/hdfs/data,/disk/sata4/hdfs/data,/disk/sata5/hdfs/data,/disk/sata6/hdfs/data,/disk/sata7/hdfs/data,/disk/sata8/hdfs/data,/disk/sata9/hdfs/data,/disk/sata10/hdfs/data,/disk/sata11/hdfs/data"

此时 hdfs web 面板上的错误不会消失,不会影响后续操作。

请求运维更换新的磁盘之后,同样的流程,修改配置文件,将新的磁盘加入,然后 reconfig 。

一切顺利的话, Failed Volumes 错误就消失了。