情况

项目中有地方使用了docker sdk,需要远程发命令去到agent上,agent接入了docker sdk,agent收到对应的下发停止命令,然后执行停止容器

但是无论如何怎么执行,都会执行停止失败,但是最后去宿主机上看,都是执行成功的,每次都是返回这样的错误

Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

但是执行create start都是正常的,就是restart和stop都是报上面的错误,

排查过程

但是有个例外就是docker server是这个版本的时候是正常的

Server: Docker Engine - Community
 Engine:
  Version:          18.09.9
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.11.13
  Git commit:       039a7df
  Built:            Thu Feb 28 06:40:58 2019
  OS/Arch:          linux/amd64
  Experimental:     false

不正常的是下面的

Server: Docker Engine - Community
 Engine:
  Version:          19.03.11
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.13.10
  Git commit:       42e35e61f3
  Built:            Mon Jun  1 09:12:26 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.2.13
  GitCommit:        7ad184331fa3e55e52b890ea95e65ba581ae3429
 runc:
  Version:          1.0.0-rc10
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

然后让运维降级

Server: Docker Engine - Community
 Engine:
  Version:          18.09.9
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.11.13
  Git commit:       039a7df
  Built:            Wed Sep  4 16:22:32 2019
  OS/Arch:          linux/amd64
  Experimental:     false

发现还是出现上面的情况

然后抓包
my

POST /v1.39/containers/b1f8bb98a25d/stop?t=20 HTTP/1.1
Host: 172.16.5.245:2375
User-Agent: Go-http-client/1.1
Content-Length: 0
Content-Type: text/plain

有时候还没数据返回
HTTP/1.1 204 No Content
Api-Version: 1.39
Docker-Experimental: false
Ostype: linux
Server: Docker/18.09.9 (linux)
Date: Wed, 01 Jul 2020 04:35:35 GMT

注意图片倒数第5行,ip 192.168.1.133是我程序ip 172.16.5.245是docker server,主动发出fin包,说明这个tcp连接是我程序端主动断开,然后docker server收到fin包后,回了http response.时长是20秒

再观察直接使用docker stop 容器
server

POST /v1.39/containers/b1f8bb98a25d/stop HTTP/1.1
Host: 172.16.5.245:2375
User-Agent: Docker-Client/19.03.11 (linux)
Content-Length: 0
Content-Type: text/plain

HTTP/1.1 204 No Content
Api-Version: 1.39
Docker-Experimental: false
Ostype: linux
Server: Docker/18.09.9 (linux)
Date: Wed, 01 Jul 2020 04:35:35 GMT

docker stop是首先先连接 然后发送pin请求,然后再发送post stop,最后还是172.16.5.245docker server主动发出fin包,时长是10秒

开始是怀疑docker sdk有问题,看了一下stop的源码
目录github.com/docker/docker/client/request.go

func (cli *Client) ContainerStop(ctx context.Context, containerID string, timeout *time.Duration) error {
    query := url.Values{}
    if timeout != nil {
        query.Set("t", timetypes.DurationToSecondsString(*timeout))
    }
    resp, err := cli.post(ctx, "/containers/"+containerID+"/stop", query, nil, nil)
    ensureReaderClosed(resp)
    return err
}

func (cli *Client) post(ctx context.Context, path string, query url.Values, obj interface{}, headers map[string][]string) (serverResponse, error) {
    body, headers, err := encodeBody(obj, headers)
    if err != nil {
        return serverResponse{}, err
    }
    return cli.sendRequest(ctx, http.MethodPost, path, query, body, headers)
}

func (cli *Client) sendRequest(ctx context.Context, method, path string, query url.Values, body io.Reader, headers headers) (serverResponse, error) {
    req, err := cli.buildRequest(method, cli.getAPIPath(ctx, path, query), body, headers)
    if err != nil {
        return serverResponse{}, err
    }
    resp, err := cli.doRequest(ctx, req)
    if err != nil {
        return resp, errdefs.FromStatusCode(err, resp.statusCode)
    }
    err = cli.checkResponseErr(resp)
    return resp, errdefs.FromStatusCode(err, resp.statusCode)
}

func (cli *Client) doRequest(ctx context.Context, req *http.Request) (serverResponse, error) {
    serverResp := serverResponse{statusCode: -1, reqURL: req.URL}

    req = req.WithContext(ctx)
    resp, err := cli.client.Do(req)
...后面略
}

func ensureReaderClosed(response serverResponse) {
    if response.body != nil {
        // Drain up to 512 bytes and close the body to let the Transport reuse the connection
        io.CopyN(ioutil.Discard, response.body, 512)
        response.body.Close()
    }
}

分析源码
这个函数 ContainerStop(控制http请求超时, 容器id, 传递给docker server超时时间)

在看看服务端处理
目录github.com/docker/docker/api/server/router/container/container_routers.go

func (s *containerRouter) postContainersStop(ctx context.Context, w http.ResponseWriter, r *http.Request, vars map[string]string) error {
    if err := httputils.ParseForm(r); err != nil {
        return err
    }

    var seconds *int
    if tmpSeconds := r.Form.Get("t"); tmpSeconds != "" {
        valSeconds, err := strconv.Atoi(tmpSeconds)
        if err != nil {
            return err
        }
        seconds = &valSeconds
    }

    if err := s.backend.ContainerStop(vars["name"], seconds); err != nil {
        return err
    }
    w.WriteHeader(http.StatusNoContent)

    return nil
}

重点是ContainerStop这个function
目录github.com/docker/docker/daemon/stop.go

func (daemon *Daemon) ContainerStop(name string, timeout *int) error {
    container, err := daemon.GetContainer(name)
    if err != nil {
        return err
    }
    if !container.IsRunning() {
        return containerNotModifiedError{running: false}
    }
    if timeout == nil {
        stopTimeout := container.StopTimeout()
        timeout = &stopTimeout
    }
    if err := daemon.containerStop(container, *timeout); err != nil {
        return errdefs.System(errors.Wrapf(err, "cannot stop container: %s", name))
    }
    return nil
}

// containerStop sends a stop signal, waits, sends a kill signal.
func (daemon *Daemon) containerStop(container *containerpkg.Container, seconds int) error {
    if !container.IsRunning() {
        return nil
    }

    stopSignal := container.StopSignal()
    // 1. Send a stop signal
    if err := daemon.killPossiblyDeadProcess(container, stopSignal); err != nil {
        // While normally we might "return err" here we're not going to
        // because if we can't stop the container by this point then
        // it's probably because it's already stopped. Meaning, between
        // the time of the IsRunning() call above and now it stopped.
        // Also, since the err return will be environment specific we can't
        // look for any particular (common) error that would indicate
        // that the process is already dead vs something else going wrong.
        // So, instead we'll give it up to 2 more seconds to complete and if
        // by that time the container is still running, then the error
        // we got is probably valid and so we force kill it.
        ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
        defer cancel()

        if status := <-container.Wait(ctx, containerpkg.WaitConditionNotRunning); status.Err() != nil {
            logrus.Infof("Container failed to stop after sending signal %d to the process, force killing", stopSignal)
            if err := daemon.killPossiblyDeadProcess(container, 9); err != nil {
                return err
            }
        }
    }

    // 2. Wait for the process to exit on its own
    ctx := context.Background()
    if seconds >= 0 {
        var cancel context.CancelFunc
        ctx, cancel = context.WithTimeout(ctx, time.Duration(seconds)*time.Second)
        defer cancel()
    }

    if status := <-container.Wait(ctx, containerpkg.WaitConditionNotRunning); status.Err() != nil {
        logrus.Infof("Container %v failed to exit within %d seconds of signal %d - using the force", container.ID, seconds, stopSignal)
        // 3. If it doesn't, then send SIGKILL
        if err := daemon.Kill(container); err != nil {
            // Wait without a timeout, ignore result.
            <-container.Wait(context.Background(), containerpkg.WaitConditionNotRunning)
            logrus.Warn(err) // Don't return error because we only care that container is stopped, not what function stopped it
        }
    }

    daemon.LogContainerEvent(container, "stop")
    return nil
}

容器停止大概3步

  1. 发送停止信号
  2. 等待进程自己退出
  3. 如果不退出,就发SIGKILL信号 就是kill -9

其中container.Wait这个function的超时是sdk client传过来的,看下他的方法

func (s *State) Wait(ctx context.Context, condition WaitCondition) <-chan StateStatus {
    s.Lock()
    defer s.Unlock()

    if condition == WaitConditionNotRunning && !s.Running {
        // Buffer so we can put it in the channel now.
        resultC := make(chan StateStatus, 1)

        // Send the current status.
        resultC <- StateStatus{
            exitCode: s.ExitCode(),
            err:      s.Err(),
        }

        return resultC
    }

    // If we are waiting only for removal, the waitStop channel should
    // remain nil and block forever.
    var waitStop chan struct{}
    if condition < WaitConditionRemoved {
        waitStop = s.waitStop
    }

    // Always wait for removal, just in case the container gets removed
    // while it is still in a "created" state, in which case it is never
    // actually stopped.
    waitRemove := s.waitRemove

    resultC := make(chan StateStatus)

    go func() {
        select {
        case <-ctx.Done():
            // Context timeout or cancellation.
            resultC <- StateStatus{
                exitCode: -1,
                err:      ctx.Err(),
            }
            return
        case <-waitStop:
        case <-waitRemove:
        }

        s.Lock()
        result := StateStatus{
            exitCode: s.ExitCode(),
            err:      s.Err(),
        }
        s.Unlock()

        resultC <- result
    }()

    return resultC
}

其中里面有个等待停止的信号,如果没收到就用sdk client提交过来的超时时间
这里在看回调用docker sdk的地方

timeOut := time.Duration(20) * time.Second
ctx, cancel := context.WithTimeout(context.Background(), timeOut)
    defer cancel()
cli.ContainerStop(ctx, pubParam.Id, &timeOut)

如果改成这样就正常了

timeOut := time.Duration(20) * time.Second
//比timeOut小就可以
stopTimeout := time.Duration(10) * time.Second
ctx, cancel := context.WithTimeout(context.Background(), timeOut)
    defer cancel()
cli.ContainerStop(ctx, containerd, &stopTimeout)

结果

出现的原因就是请求设置的httpclient超时和传给docker server停止容器的超时时间是一样的,这样会导致docker 使用你的超时时间去等待容器停止,容器停止后给你回包,发现这个时候你已经主动发fin包要断tcp连接了,只要将传给docker server停止容器的超时时间小于你httpclient的超时时间,数据就能正常返回,sdk就能正常处理

场景

最近项目中使用了golang.org/x/sys 这个包,但是编译的时候出现

go: github.com/golang/[email protected]: parsing go.mod: unexpected module path "golang.org/x/sys"

当时立马查看了go.mod

发现是这样的

replace (
golang.org/x/sys => github.com/golang/sys v0.0.0-20200602225109-6fdc65e7d980
)

我在项目中是这样使用

import (
  "golang.org/x/sys"
)

这个错误是说加载模块出错,明明替换了的,怎么会加载出错,第一反应是马上去看golang.org/x/sys/go.mod

module golang.org/x/sys

go 1.12

那表示我写的没错,那我写的没错,是不是别的地方写错了,然后全局搜索github.com/golang/sys,发现有地方是直接使用了github.com/golang/sys,然后将他改成golang.org/x/sys,然后编译通过。

问题

测试环境正常,正常环境使用了阿里云的SLB后出现报错:

Error on AMQP connection <0.19464.506>: enotconn (socket is not connected)

频率大概是1分钟一条

原因总结

tcp监控检查有2种方式

方式1

aliyunSLB->RabbitMQ: SYN
RabbitMQ->aliyunSLB: SYN,ACK
aliyunSLB->RabbitMQ: ACK
aliyunSLB->RabbitMQ: RST,ACK

方式2

HAProxy->RabbitMQ: SYN
RabbitMQ->HAProxy: SYN,ACK
HAProxy->RabbitMQ: RST,ACK

区别

两者其实存在一定区别:根据 TCP/IP 协议中的描述,被动打开连接的一方,在收到三次握手的最后一个 ACK 时才进入 ESTABLISHED 状态,进而才能够在用户态通过 accept 取走;对比上面两种实现方式,aliyunSLB 的实现方式为先完成三次握手,再通过 RST 强行终止连接,RabbitMQ 看到的情况为:有客户端向自己建立了连接,又莫名其妙的断开;而 HAProxy 的实现方式为通过 RST 直接终止三次握手,RabbitMQ 在业务层面不会感知到这种行为;

阿里云给出的方案

正常的TCP三次握手,LVS节点服务器在收到后端ECS返回的SYN+ACK数据包后,会进一步发送ACK数据包,随后立即发送RST数据包中断TCP连接。
该实现机制可能会导致后端ECS认为相关TCP连接出现异常(非正常退出),并在业务软件如Java连接池等日志中抛出相应的错误信息,如Connection reset by peer。

解决方案:

TCP监听采用HTTP方式进行健康检查。
在后端ECS配置了获取客户端真实IP后,忽略来自前述负载均衡服务地址段相关访问导致的连接错误。

阿里云SLB健康检查流程

影响范围

目前看来,该问题会造成

  • 日志中出现大量连接相关错误信息;
  • 占用部分可用的并发连接数量;
  • 浪费一定的 Erlang VM 调度处理时间(处理这种不必要逻辑所浪费的 reduction 时间片);
  • (在相关进程的创建销毁过程中)可能会造成一定程度的内存消耗(和 GC 处理方式有关系);

源码分析

在 rabbit_reader.erl 中


...
%% 正式开始 TCP + AMQP 协议处理
start_connection(Parent, HelperSup, Deb, Sock) ->
    ...
    %% 获取当前 TCP 连接两端的 ip 和 port 信息,拼接成连接信息字符串
    Name = case rabbit_net:connection_string(Sock, inbound) of
               {ok, Str}         -> Str;
               %% 在获取时,触发 socket 错误,认为当前连接已不存在
               %% 注意:这里没有输出异常日志
               {error, enotconn} -> rabbit_net:fast_close(Sock),
                                    exit(normal);
               {error, Reason}   -> socket_error(Reason),
                                    rabbit_net:fast_close(Sock),
                                    exit(normal)
           end,
    ...
    %% 输出异常日志的地方
    %% 关键:上面 rabbit_net:connection_string 中调用的就是 rabbit_net:socket_ends
    %% 同样的代码上面没有报错,而此处会报错,说明在两段临近代码的执行间发生了 TCP 连接断开!
    {PeerHost, PeerPort, Host, Port} =
        socket_op(Sock, fun (S) -> rabbit_net:socket_ends(S, inbound) end),
    ...
    done.
...
socket_op(Sock, Fun) ->
    case Fun(Sock) of
        {ok, Res}       -> Res;
        {error, Reason} -> %% 输出错误日志
                           socket_error(Reason),
                           %% 关闭 TCP 连接
                           rabbit_net:fast_close(Sock),
                           %% 正常退出 rabbit_reader 进程
                           exit(normal)
    end.
...
socket_error(Reason) when is_atom(Reason) ->
    log(error, "Error on AMQP connection ~p: ~s~n",
        [self(), rabbit_misc:format_inet_error(Reason)]);
...

在 rabbit_net.erl 中

...
connection_string(Sock, Direction) ->
    %% 获取 socket 连接两端 ip 和 port 信息
    case socket_ends(Sock, Direction) of
        {ok, {FromAddress, FromPort, ToAddress, ToPort}} ->
            {ok, rabbit_misc:format(
                   "~s:~p -> ~s:~p",
                   [maybe_ntoab(FromAddress), FromPort,
                    maybe_ntoab(ToAddress),   ToPort])};
        Error ->
            Error
    end.

socket_ends(Sock, Direction) ->
    {From, To} = sock_funs(Direction),
    %% 获取 tcp 通信两端的 ip 和 port
    case {From(Sock), To(Sock)} of
        {{ok, {FromAddress, FromPort}}, {ok, {ToAddress, ToPort}}} ->
            {ok, {rdns(FromAddress), FromPort,
                  rdns(ToAddress),   ToPort}};
        {{error, _Reason} = Error, _} ->
            Error;
        {_, {error, _Reason} = Error} ->
            Error
    end.
...

最终解决方案

最后我们的解决方案是
在SLB里用http做健康检查
curl -i -u guest:guest http://localhost:15672/api/vhosts
请求这个api接口,如果接口返回200,表示服务存在

因为15672和5672是同一个进程监听的,5672就是rabbitMQ的端口,所以使用此方案可以解决上述的错误问题

参考了这篇文章线上 RabbitMQ 问题排查,最终解决了问题

需求场景

健康检查的http请求过于频繁,不希望将这部分日志打印出来

改动

r := gin.New()
// r.Use(Logger())//不使用包提供的日志
//自定义输出日志
r.Use(gin.LoggerWithConfig(gin.LoggerConfig{
    Formatter: func(param gin.LogFormatterParams) string {
        //健康检查不输出
        if param.Path == "/check/health" {
            return ""
        }
        var statusColor, methodColor, resetColor string
        if param.IsOutputColor() {
            statusColor = param.StatusCodeColor()
            methodColor = param.MethodColor()
            resetColor = param.ResetColor()
        }

        if param.Latency > time.Minute {
            // Truncate in a golang < 1.8 safe way
            param.Latency = param.Latency - param.Latency%time.Second
        }
        return fmt.Sprintf("[GIN] %v |%s %3d %s| %13v | %15s |%s %-7s %s %#v\n%s",
            param.TimeStamp.Format("2006/01/02 - 15:04:05"),
            statusColor, param.StatusCode, resetColor,
            param.Latency,
            param.ClientIP,
            methodColor, param.Method, resetColor,
            param.Path,
            param.ErrorMessage,
        )
    },
    Output: gin.DefaultWriter,
}))

情况

每次需要打字都要先把charles关了才可以

解决方案

和普通手机一样设置网络代理,需要多加一个:
对以下网址不使用代理:
10.0.2.2,mumu.nie.netease.com,graph.facebook.com
否则开启Charles后,请求这几个地址的网络会一直处在connection中,卡死模拟器及键盘输入。