docker api stop 超时问题跟踪处理

情况

项目中有地方使用了docker sdk,需要远程发命令去到agent上,agent接入了docker sdk,agent收到对应的下发停止命令,然后执行停止容器

但是无论如何怎么执行,都会执行停止失败,但是最后去宿主机上看,都是执行成功的,每次都是返回这样的错误

Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

但是执行create start都是正常的,就是restart和stop都是报上面的错误,

排查过程

但是有个例外就是docker server是这个版本的时候是正常的

Server: Docker Engine - Community
 Engine:
  Version:          18.09.9
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.11.13
  Git commit:       039a7df
  Built:            Thu Feb 28 06:40:58 2019
  OS/Arch:          linux/amd64
  Experimental:     false

不正常的是下面的

Server: Docker Engine - Community
 Engine:
  Version:          19.03.11
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.13.10
  Git commit:       42e35e61f3
  Built:            Mon Jun  1 09:12:26 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.2.13
  GitCommit:        7ad184331fa3e55e52b890ea95e65ba581ae3429
 runc:
  Version:          1.0.0-rc10
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

然后让运维降级

Server: Docker Engine - Community
 Engine:
  Version:          18.09.9
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.11.13
  Git commit:       039a7df
  Built:            Wed Sep  4 16:22:32 2019
  OS/Arch:          linux/amd64
  Experimental:     false

发现还是出现上面的情况

然后抓包
my

POST /v1.39/containers/b1f8bb98a25d/stop?t=20 HTTP/1.1
Host: 172.16.5.245:2375
User-Agent: Go-http-client/1.1
Content-Length: 0
Content-Type: text/plain

有时候还没数据返回
HTTP/1.1 204 No Content
Api-Version: 1.39
Docker-Experimental: false
Ostype: linux
Server: Docker/18.09.9 (linux)
Date: Wed, 01 Jul 2020 04:35:35 GMT

注意图片倒数第5行,ip 192.168.1.133是我程序ip 172.16.5.245是docker server,主动发出fin包,说明这个tcp连接是我程序端主动断开,然后docker server收到fin包后,回了http response.时长是20秒

再观察直接使用docker stop 容器
server

POST /v1.39/containers/b1f8bb98a25d/stop HTTP/1.1
Host: 172.16.5.245:2375
User-Agent: Docker-Client/19.03.11 (linux)
Content-Length: 0
Content-Type: text/plain

HTTP/1.1 204 No Content
Api-Version: 1.39
Docker-Experimental: false
Ostype: linux
Server: Docker/18.09.9 (linux)
Date: Wed, 01 Jul 2020 04:35:35 GMT

docker stop是首先先连接 然后发送pin请求,然后再发送post stop,最后还是172.16.5.245docker server主动发出fin包,时长是10秒

开始是怀疑docker sdk有问题,看了一下stop的源码
目录github.com/docker/docker/client/request.go

func (cli *Client) ContainerStop(ctx context.Context, containerID string, timeout *time.Duration) error {
    query := url.Values{}
    if timeout != nil {
        query.Set("t", timetypes.DurationToSecondsString(*timeout))
    }
    resp, err := cli.post(ctx, "/containers/"+containerID+"/stop", query, nil, nil)
    ensureReaderClosed(resp)
    return err
}

func (cli *Client) post(ctx context.Context, path string, query url.Values, obj interface{}, headers map[string][]string) (serverResponse, error) {
    body, headers, err := encodeBody(obj, headers)
    if err != nil {
        return serverResponse{}, err
    }
    return cli.sendRequest(ctx, http.MethodPost, path, query, body, headers)
}

func (cli *Client) sendRequest(ctx context.Context, method, path string, query url.Values, body io.Reader, headers headers) (serverResponse, error) {
    req, err := cli.buildRequest(method, cli.getAPIPath(ctx, path, query), body, headers)
    if err != nil {
        return serverResponse{}, err
    }
    resp, err := cli.doRequest(ctx, req)
    if err != nil {
        return resp, errdefs.FromStatusCode(err, resp.statusCode)
    }
    err = cli.checkResponseErr(resp)
    return resp, errdefs.FromStatusCode(err, resp.statusCode)
}

func (cli *Client) doRequest(ctx context.Context, req *http.Request) (serverResponse, error) {
    serverResp := serverResponse{statusCode: -1, reqURL: req.URL}

    req = req.WithContext(ctx)
    resp, err := cli.client.Do(req)
...后面略
}

func ensureReaderClosed(response serverResponse) {
    if response.body != nil {
        // Drain up to 512 bytes and close the body to let the Transport reuse the connection
        io.CopyN(ioutil.Discard, response.body, 512)
        response.body.Close()
    }
}

分析源码
这个函数 ContainerStop(控制http请求超时, 容器id, 传递给docker server超时时间)

在看看服务端处理
目录github.com/docker/docker/api/server/router/container/container_routers.go

func (s *containerRouter) postContainersStop(ctx context.Context, w http.ResponseWriter, r *http.Request, vars map[string]string) error {
    if err := httputils.ParseForm(r); err != nil {
        return err
    }

    var seconds *int
    if tmpSeconds := r.Form.Get("t"); tmpSeconds != "" {
        valSeconds, err := strconv.Atoi(tmpSeconds)
        if err != nil {
            return err
        }
        seconds = &valSeconds
    }

    if err := s.backend.ContainerStop(vars["name"], seconds); err != nil {
        return err
    }
    w.WriteHeader(http.StatusNoContent)

    return nil
}

重点是ContainerStop这个function
目录github.com/docker/docker/daemon/stop.go

func (daemon *Daemon) ContainerStop(name string, timeout *int) error {
    container, err := daemon.GetContainer(name)
    if err != nil {
        return err
    }
    if !container.IsRunning() {
        return containerNotModifiedError{running: false}
    }
    if timeout == nil {
        stopTimeout := container.StopTimeout()
        timeout = &stopTimeout
    }
    if err := daemon.containerStop(container, *timeout); err != nil {
        return errdefs.System(errors.Wrapf(err, "cannot stop container: %s", name))
    }
    return nil
}

// containerStop sends a stop signal, waits, sends a kill signal.
func (daemon *Daemon) containerStop(container *containerpkg.Container, seconds int) error {
    if !container.IsRunning() {
        return nil
    }

    stopSignal := container.StopSignal()
    // 1. Send a stop signal
    if err := daemon.killPossiblyDeadProcess(container, stopSignal); err != nil {
        // While normally we might "return err" here we're not going to
        // because if we can't stop the container by this point then
        // it's probably because it's already stopped. Meaning, between
        // the time of the IsRunning() call above and now it stopped.
        // Also, since the err return will be environment specific we can't
        // look for any particular (common) error that would indicate
        // that the process is already dead vs something else going wrong.
        // So, instead we'll give it up to 2 more seconds to complete and if
        // by that time the container is still running, then the error
        // we got is probably valid and so we force kill it.
        ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
        defer cancel()

        if status := <-container.Wait(ctx, containerpkg.WaitConditionNotRunning); status.Err() != nil {
            logrus.Infof("Container failed to stop after sending signal %d to the process, force killing", stopSignal)
            if err := daemon.killPossiblyDeadProcess(container, 9); err != nil {
                return err
            }
        }
    }

    // 2. Wait for the process to exit on its own
    ctx := context.Background()
    if seconds >= 0 {
        var cancel context.CancelFunc
        ctx, cancel = context.WithTimeout(ctx, time.Duration(seconds)*time.Second)
        defer cancel()
    }

    if status := <-container.Wait(ctx, containerpkg.WaitConditionNotRunning); status.Err() != nil {
        logrus.Infof("Container %v failed to exit within %d seconds of signal %d - using the force", container.ID, seconds, stopSignal)
        // 3. If it doesn't, then send SIGKILL
        if err := daemon.Kill(container); err != nil {
            // Wait without a timeout, ignore result.
            <-container.Wait(context.Background(), containerpkg.WaitConditionNotRunning)
            logrus.Warn(err) // Don't return error because we only care that container is stopped, not what function stopped it
        }
    }

    daemon.LogContainerEvent(container, "stop")
    return nil
}

容器停止大概3步

  1. 发送停止信号
  2. 等待进程自己退出
  3. 如果不退出,就发SIGKILL信号 就是kill -9

其中container.Wait这个function的超时是sdk client传过来的,看下他的方法

func (s *State) Wait(ctx context.Context, condition WaitCondition) <-chan StateStatus {
    s.Lock()
    defer s.Unlock()

    if condition == WaitConditionNotRunning && !s.Running {
        // Buffer so we can put it in the channel now.
        resultC := make(chan StateStatus, 1)

        // Send the current status.
        resultC <- StateStatus{
            exitCode: s.ExitCode(),
            err:      s.Err(),
        }

        return resultC
    }

    // If we are waiting only for removal, the waitStop channel should
    // remain nil and block forever.
    var waitStop chan struct{}
    if condition < WaitConditionRemoved {
        waitStop = s.waitStop
    }

    // Always wait for removal, just in case the container gets removed
    // while it is still in a "created" state, in which case it is never
    // actually stopped.
    waitRemove := s.waitRemove

    resultC := make(chan StateStatus)

    go func() {
        select {
        case <-ctx.Done():
            // Context timeout or cancellation.
            resultC <- StateStatus{
                exitCode: -1,
                err:      ctx.Err(),
            }
            return
        case <-waitStop:
        case <-waitRemove:
        }

        s.Lock()
        result := StateStatus{
            exitCode: s.ExitCode(),
            err:      s.Err(),
        }
        s.Unlock()

        resultC <- result
    }()

    return resultC
}

其中里面有个等待停止的信号,如果没收到就用sdk client提交过来的超时时间
这里在看回调用docker sdk的地方

timeOut := time.Duration(20) * time.Second
ctx, cancel := context.WithTimeout(context.Background(), timeOut)
    defer cancel()
cli.ContainerStop(ctx, pubParam.Id, &timeOut)

如果改成这样就正常了

timeOut := time.Duration(20) * time.Second
//比timeOut小就可以
stopTimeout := time.Duration(10) * time.Second
ctx, cancel := context.WithTimeout(context.Background(), timeOut)
    defer cancel()
cli.ContainerStop(ctx, containerd, &stopTimeout)

结果

出现的原因就是请求设置的httpclient超时和传给docker server停止容器的超时时间是一样的,这样会导致docker 使用你的超时时间去等待容器停止,容器停止后给你回包,发现这个时候你已经主动发fin包要断tcp连接了,只要将传给docker server停止容器的超时时间小于你httpclient的超时时间,数据就能正常返回,sdk就能正常处理

添加新评论