情况
项目中有地方使用了docker sdk,需要远程发命令去到agent上,agent接入了docker sdk,agent收到对应的下发停止命令,然后执行停止容器
但是无论如何怎么执行,都会执行停止失败,但是最后去宿主机上看,都是执行成功的,每次都是返回这样的错误
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
但是执行create start都是正常的,就是restart和stop都是报上面的错误,
排查过程
但是有个例外就是docker server是这个版本的时候是正常的
Server: Docker Engine - Community
Engine:
Version: 18.09.9
API version: 1.39 (minimum version 1.12)
Go version: go1.11.13
Git commit: 039a7df
Built: Thu Feb 28 06:40:58 2019
OS/Arch: linux/amd64
Experimental: false
不正常的是下面的
Server: Docker Engine - Community
Engine:
Version: 19.03.11
API version: 1.40 (minimum version 1.12)
Go version: go1.13.10
Git commit: 42e35e61f3
Built: Mon Jun 1 09:12:26 2020
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.2.13
GitCommit: 7ad184331fa3e55e52b890ea95e65ba581ae3429
runc:
Version: 1.0.0-rc10
GitCommit: dc9208a3303feef5b3839f4323d9beb36df0a9dd
docker-init:
Version: 0.18.0
GitCommit: fec3683
然后让运维降级
Server: Docker Engine - Community
Engine:
Version: 18.09.9
API version: 1.39 (minimum version 1.12)
Go version: go1.11.13
Git commit: 039a7df
Built: Wed Sep 4 16:22:32 2019
OS/Arch: linux/amd64
Experimental: false
发现还是出现上面的情况
然后抓包
POST /v1.39/containers/b1f8bb98a25d/stop?t=20 HTTP/1.1
Host: 172.16.5.245:2375
User-Agent: Go-http-client/1.1
Content-Length: 0
Content-Type: text/plain
有时候还没数据返回
HTTP/1.1 204 No Content
Api-Version: 1.39
Docker-Experimental: false
Ostype: linux
Server: Docker/18.09.9 (linux)
Date: Wed, 01 Jul 2020 04:35:35 GMT
注意图片倒数第5行,ip 192.168.1.133是我程序ip 172.16.5.245是docker server,主动发出fin包,说明这个tcp连接是我程序端主动断开,然后docker server收到fin包后,回了http response.时长是20秒
再观察直接使用docker stop 容器
POST /v1.39/containers/b1f8bb98a25d/stop HTTP/1.1
Host: 172.16.5.245:2375
User-Agent: Docker-Client/19.03.11 (linux)
Content-Length: 0
Content-Type: text/plain
HTTP/1.1 204 No Content
Api-Version: 1.39
Docker-Experimental: false
Ostype: linux
Server: Docker/18.09.9 (linux)
Date: Wed, 01 Jul 2020 04:35:35 GMT
docker stop是首先先连接 然后发送pin请求,然后再发送post stop,最后还是172.16.5.245docker server主动发出fin包,时长是10秒
开始是怀疑docker sdk有问题,看了一下stop的源码
目录github.com/docker/docker/client/request.go
func (cli *Client) ContainerStop(ctx context.Context, containerID string, timeout *time.Duration) error {
query := url.Values{}
if timeout != nil {
query.Set("t", timetypes.DurationToSecondsString(*timeout))
}
resp, err := cli.post(ctx, "/containers/"+containerID+"/stop", query, nil, nil)
ensureReaderClosed(resp)
return err
}
func (cli *Client) post(ctx context.Context, path string, query url.Values, obj interface{}, headers map[string][]string) (serverResponse, error) {
body, headers, err := encodeBody(obj, headers)
if err != nil {
return serverResponse{}, err
}
return cli.sendRequest(ctx, http.MethodPost, path, query, body, headers)
}
func (cli *Client) sendRequest(ctx context.Context, method, path string, query url.Values, body io.Reader, headers headers) (serverResponse, error) {
req, err := cli.buildRequest(method, cli.getAPIPath(ctx, path, query), body, headers)
if err != nil {
return serverResponse{}, err
}
resp, err := cli.doRequest(ctx, req)
if err != nil {
return resp, errdefs.FromStatusCode(err, resp.statusCode)
}
err = cli.checkResponseErr(resp)
return resp, errdefs.FromStatusCode(err, resp.statusCode)
}
func (cli *Client) doRequest(ctx context.Context, req *http.Request) (serverResponse, error) {
serverResp := serverResponse{statusCode: -1, reqURL: req.URL}
req = req.WithContext(ctx)
resp, err := cli.client.Do(req)
...后面略
}
func ensureReaderClosed(response serverResponse) {
if response.body != nil {
// Drain up to 512 bytes and close the body to let the Transport reuse the connection
io.CopyN(ioutil.Discard, response.body, 512)
response.body.Close()
}
}
分析源码
这个函数 ContainerStop(控制http请求超时, 容器id, 传递给docker server超时时间)
在看看服务端处理
目录github.com/docker/docker/api/server/router/container/container_routers.go
func (s *containerRouter) postContainersStop(ctx context.Context, w http.ResponseWriter, r *http.Request, vars map[string]string) error {
if err := httputils.ParseForm(r); err != nil {
return err
}
var seconds *int
if tmpSeconds := r.Form.Get("t"); tmpSeconds != "" {
valSeconds, err := strconv.Atoi(tmpSeconds)
if err != nil {
return err
}
seconds = &valSeconds
}
if err := s.backend.ContainerStop(vars["name"], seconds); err != nil {
return err
}
w.WriteHeader(http.StatusNoContent)
return nil
}
重点是ContainerStop这个function
目录github.com/docker/docker/daemon/stop.go
func (daemon *Daemon) ContainerStop(name string, timeout *int) error {
container, err := daemon.GetContainer(name)
if err != nil {
return err
}
if !container.IsRunning() {
return containerNotModifiedError{running: false}
}
if timeout == nil {
stopTimeout := container.StopTimeout()
timeout = &stopTimeout
}
if err := daemon.containerStop(container, *timeout); err != nil {
return errdefs.System(errors.Wrapf(err, "cannot stop container: %s", name))
}
return nil
}
// containerStop sends a stop signal, waits, sends a kill signal.
func (daemon *Daemon) containerStop(container *containerpkg.Container, seconds int) error {
if !container.IsRunning() {
return nil
}
stopSignal := container.StopSignal()
// 1. Send a stop signal
if err := daemon.killPossiblyDeadProcess(container, stopSignal); err != nil {
// While normally we might "return err" here we're not going to
// because if we can't stop the container by this point then
// it's probably because it's already stopped. Meaning, between
// the time of the IsRunning() call above and now it stopped.
// Also, since the err return will be environment specific we can't
// look for any particular (common) error that would indicate
// that the process is already dead vs something else going wrong.
// So, instead we'll give it up to 2 more seconds to complete and if
// by that time the container is still running, then the error
// we got is probably valid and so we force kill it.
ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
defer cancel()
if status := <-container.Wait(ctx, containerpkg.WaitConditionNotRunning); status.Err() != nil {
logrus.Infof("Container failed to stop after sending signal %d to the process, force killing", stopSignal)
if err := daemon.killPossiblyDeadProcess(container, 9); err != nil {
return err
}
}
}
// 2. Wait for the process to exit on its own
ctx := context.Background()
if seconds >= 0 {
var cancel context.CancelFunc
ctx, cancel = context.WithTimeout(ctx, time.Duration(seconds)*time.Second)
defer cancel()
}
if status := <-container.Wait(ctx, containerpkg.WaitConditionNotRunning); status.Err() != nil {
logrus.Infof("Container %v failed to exit within %d seconds of signal %d - using the force", container.ID, seconds, stopSignal)
// 3. If it doesn't, then send SIGKILL
if err := daemon.Kill(container); err != nil {
// Wait without a timeout, ignore result.
<-container.Wait(context.Background(), containerpkg.WaitConditionNotRunning)
logrus.Warn(err) // Don't return error because we only care that container is stopped, not what function stopped it
}
}
daemon.LogContainerEvent(container, "stop")
return nil
}
容器停止大概3步
- 发送停止信号
- 等待进程自己退出
- 如果不退出,就发SIGKILL信号 就是kill -9
其中container.Wait这个function的超时是sdk client传过来的,看下他的方法
func (s *State) Wait(ctx context.Context, condition WaitCondition) <-chan StateStatus {
s.Lock()
defer s.Unlock()
if condition == WaitConditionNotRunning && !s.Running {
// Buffer so we can put it in the channel now.
resultC := make(chan StateStatus, 1)
// Send the current status.
resultC <- StateStatus{
exitCode: s.ExitCode(),
err: s.Err(),
}
return resultC
}
// If we are waiting only for removal, the waitStop channel should
// remain nil and block forever.
var waitStop chan struct{}
if condition < WaitConditionRemoved {
waitStop = s.waitStop
}
// Always wait for removal, just in case the container gets removed
// while it is still in a "created" state, in which case it is never
// actually stopped.
waitRemove := s.waitRemove
resultC := make(chan StateStatus)
go func() {
select {
case <-ctx.Done():
// Context timeout or cancellation.
resultC <- StateStatus{
exitCode: -1,
err: ctx.Err(),
}
return
case <-waitStop:
case <-waitRemove:
}
s.Lock()
result := StateStatus{
exitCode: s.ExitCode(),
err: s.Err(),
}
s.Unlock()
resultC <- result
}()
return resultC
}
其中里面有个等待停止的信号,如果没收到就用sdk client提交过来的超时时间
这里在看回调用docker sdk的地方
timeOut := time.Duration(20) * time.Second
ctx, cancel := context.WithTimeout(context.Background(), timeOut)
defer cancel()
cli.ContainerStop(ctx, pubParam.Id, &timeOut)
如果改成这样就正常了
timeOut := time.Duration(20) * time.Second
//比timeOut小就可以
stopTimeout := time.Duration(10) * time.Second
ctx, cancel := context.WithTimeout(context.Background(), timeOut)
defer cancel()
cli.ContainerStop(ctx, containerd, &stopTimeout)
结果
出现的原因就是请求设置的httpclient超时和传给docker server停止容器的超时时间是一样的,这样会导致docker 使用你的超时时间去等待容器停止,容器停止后给你回包,发现这个时候你已经主动发fin包要断tcp连接了,只要将传给docker server停止容器的超时时间小于你httpclient的超时时间,数据就能正常返回,sdk就能正常处理