Linux ·

Hadoop reduce阶段出现Failed to fetch错误及解决

最近运行Hadoop1.1出现map执行100%,reduce卡在0%的情况,甚至会出现无法启动datanode的情况。查看了一下日志,大致看到 Failed to fetch 字段,以及拒绝连接错误connection refused,查看配置没有发现问题,于是怀疑 /etc/hosts 文件配置的影响,参考了一下wiki的关于hadoop的connection refused页面,大致如下:

Connection Refused

You get a ConnectionRefused Exception when there is a machine at the address specified, but there is no program listening on the specific TCP port the client is using -and there is no firewall in the way silently dropping TCP connection requests. If you do not know what a TCP connection request is, please consult the specification.

Unless there is a configuration error at either end, a common cause for this is the Hadoop service isn't running.

  1. Check the hostname the client using is correct
  2. Check the IP address the client gets for the hostname is correct.
  3. Check that there isn't an entry for your hostname mapped to or in /etc/hosts (Ubuntu is notorious for this)
  4. Check the port the client is using matches that the server is offering a service on.
  5. On the server, try a telnet localhost <port> to see if the port is open there.

  6. On the client, try a telnet <server> <port> to see if the port is accessible remotely.

  7. Try connecting to the server/port from a different machine, to see if it just the single client misbehaving.

None of these are Hadoop problems, they are host, network and firewall configuration issues. As it is your cluster, only you can find out and track down the problem.

其中很关键的一条: 一定要将节点的hostname与其在hadoop配置中的IP地址(或域名, 在slaves或master文件中)绑定。例如:  hadoop01

另如果节点hostname未曾更改过,hosts文件会有hostname与127.0.,0,1的绑定:  localhost localhost.localdomain

用hostname命令查看本机域名,可能是localhost.localdomain或localhost,需要将其屏蔽掉。值此问题解决,任务可以正常执行。但是还是无法在hostname:50070上查看hdfs上的文件(Browse the filesystem 打不开)。

Hadoop项目之基于CentOS7的Cloudera 5.10.1(CDH)的安装部署



Ubuntu 14.04下Hadoop集群安装

CentOS 6.7安装Hadoop 2.7.2

Ubuntu 16.04上构建分布式Hadoop-2.7.3集群

CentOS 7.3下Hadoop2.8分布式集群安装与测试

CentOS 7 下 Hadoop 2.6.4 分布式集群环境搭建