error-creating-vxlan-interface-file-exists

背景:客户,三节点docker swarm集群,客户开启内置firewalld防火墙,导致iptables规则变更,进一步导致容器启动失败

容器启动后报错如下

Aug 2 16:14:51 BDTP-DAW-P01 firewalld[22631]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w2 -t nat -C POSTROUTING -m ipvs --ipvs -d 192.168.88.0/24 -j SNAT --to-source 192.168.88.2' failed: iptables: No chain/target/match by that name.
Aug 2 16:14:56 BDTP-DAW-P01 firewalld[22631]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w2 -t nat -C POSTROUTING -m ipvs --ipvs -d 192.168.88.0/24 -j SNAT --to-source 192.168.88.2' failed: iptables: No chain/target/match by that name.
Aug 2 16:15:01 BDTP-DAW-P01 firewalld[22631]: WARNING: COMMAND_FAILED: '/usr/sbin/iptables -w2 -t nat -C POSTROUTING -m ipvs --ipvs -d 192.168.88.0/24 -j SNAT --to-source 192.168.88.2' failed: iptables: No chain/target/match by that name.

后面关闭防火墙彻底清空iptables防火墙规则,启动容器,机器2始终报下面的错误

Aug 2 17:29:21 BDTP-DAW-P02 dockerd: time="2019-08-02T17:29:21.649764914+08:00" level=warning msg="Error (Unable to complete atomic operation, key modified) deleting object [endpoint_count e6f0fahm2gmv4ozdg2i8yrlyj], retrying...."
Aug 2 17:29:21 BDTP-DAW-P02 dockerd: time="2019-08-02T17:29:21.649905556+08:00" level=error msg="fatal task error" error="subnet sandbox join failed for \"192.168.87.0/24\": error creating vxlan interface: file exists" module=node/a
gent/taskmanager node.id=y8xarqdvtk4gf5yyipumse0wz service.id=r97knj4dzde7d5chfxr3l2a3l task.id=u4sjye1iteaeke4ax11zjuxn2
Aug 2 17:29:21 BDTP-DAW-P02 dockerd: time="2019-08-02T17:29:21.650089555+08:00" level=error msg="failed adding service binding for 82106d486e24761022819aaa74588fd8e98cf879cca52a1578751db11669f662 epRec:{zk1.1.lucwiqk4h09ryewbxo9y3yq
nj zk1 nklecxnyyey6a37r4ljb0xmsv 192.168.87.10 192.168.87.11 [] [] [a0d2f1160996] false} err:network e6f0fahm2gmv4ozdg2i8yrlyj not found"
Aug 2 17:29:21 BDTP-DAW-P02 dockerd: time="2019-08-02T17:29:21.650168000+08:00" level=error msg="failed adding service binding for 88a3a30832ac2823b4c114d56f83a81ca113c2eab7883c1d88fa0c304394f480 epRec:{zk3.1.jevlod1t48whjr0poklsq8r
xa zk3 mdpwqbczckavrbreaqfaw7zjb 192.168.87.15 192.168.87.16 [] [] [96fdc0e07bac] false} err:network e6f0fahm2gmv4ozdg2i8yrlyj not found"
Aug 2 17:29:21 BDTP-DAW-P02 dockerd: time="2019-08-02T17:29:21.650222574+08:00" level=warning msg="rmServiceBinding handleEpTableEvent zk1 82106d486e24761022819aaa74588fd8e98cf879cca52a1578751db11669f662 aborted c.serviceBindings[sk
ey] !ok"
Aug 2 17:29:21 BDTP-DAW-P02 dockerd: time="2019-08-02T17:29:21.650257330+08:00" level=warning msg="rmServiceBinding handleEpTableEvent zk3 88a3a30832ac2823b4c114d56f83a81ca113c2eab7883c1d88fa0c304394f480 aborted c.serviceBindings[sk
ey] !ok"
Aug 2 17:29:21 BDTP-DAW-P02 dockerd: time="2019-08-02T17:29:21.688268767+08:00" level=warning msg="failed to deactivate service binding for container zk2.1.ti2ctqtognde3hbkqidune1s6" error="No such container: zk2.1.ti2ctqtognde3hbkq
idune1s6" module=node/agent node.id=y8xarqdvtk4gf5yyipumse0wz

root@BDTP-DAW-P01:/root#docker service ps zk2 --no-trunc
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR PORTS
yi6dvo6ygpbh20ca52hauuqjj zk2.1 registry.datapipeline.com/dp_zookeeper:5.2.1 BDTP-DAW-P02 Ready Rejected 3 seconds ago "subnet sandbox join failed for "192.168.88.0/24": error creating vxlan interface: file exists"
17mnybhe40apzdrowlf24qzb1 \_ zk2.1 registry.datapipeline.com/dp_zookeeper:5.2.1 BDTP-DAW-P02 Shutdown Rejected 8 seconds ago "subnet sandbox join failed for "192.168.88.0/24": error creating vxlan interface: file exists"
2ze9voq3mqinohoq7fpb8erhz \_ zk2.1 registry.datapipeline.com/dp_zookeeper:5.2.1 BDTP-DAW-P02 Shutdown Rejected 13 seconds ago "subnet sandbox join failed for "192.168.88.0/24": error creating vxlan interface: file exists"
fevz2qphrb86va2jfkqi9oljj \_ zk2.1 registry.datapipeline.com/dp_zookeeper:5.2.1 BDTP-DAW-P02 Shutdown Rejected 18 seconds ago "subnet sandbox join failed for "192.168.88.0/24": error creating vxlan interface: file exists"
27thiv4zvxqcwk2yjsdnuiiyz \_ zk2.1 registry.datapipeline.com/dp_zookeeper:5.2.1 BDTP-DAW-P02 Shutdown Rejected 23 seconds ago "subnet sandbox join failed for "192.168.88.0/24": error creating vxlan interface: file exists"

在这块卡住了相当长的时间,进行了如下操作,均无效

1 停止3台机器的docker,重启docker

2 彻底清空3台机器的iptables规则,停止firewalld

3 重建docker swarm集群,重新打标签

4 修改docker0网卡的网络地址

5 修改docker swarm集群的网络地址

后谷歌和百度,发现该问题是一个已有的bug,GitHub issue如下:

https://github.com/docker/libnetwork/issues/1765

经过分析和理解,我们认为如下操作应该是可取的

我们对3台机器,停止docker,做上述操作,还是无效,随后我们又尝试了如下操作:

1 停掉全部docker service服务

2 直接删除整个/var/run/docker目录

3 重新安装docker,删除/data/docker数据目录

4 让机器2成为docker swarm leader,在机器2上启动服务

5 更换容器的挂载目录

还是无效,报错依旧

最后我们准备放大招:一手重启机器,然而我们并没有

随后我们尝试了最后一种方案,修改docker swarm overlay network的名称,问题解决

docker network create –driver overlay newname