Go并发模式：管道和显式取消

引言

Go并发原语使得构建流式数据管道，高效利用I/O和多核变得简单。这篇文章介绍了几个管道例子，重点指出在操作失败时的细微差别，并介绍了优雅处理失败的技术。

什么是管道？

Go没有正式的管道定义。管道只是众多并发程序的一类。一般的，一个管道就是一些列的由channel连接起来的阶段。每个阶段都有执行相同逻辑的goroutine。在每个阶段中，goroutine

从channel读取上游数据
在数据上执行一些操作，通常会产生新的数据
通过channel将数据发往下游

每个阶段都可以有任意个输入channel和输出channel，除了第一个和最有一个channel（只有输入channel或只有输出channel）。第一个步骤通常叫数据源或者生产者，最后一个叫做存储池或者消费者。

我们先从一个简单的管道例子来解释这些概念和技术，稍后我们会介绍一个更为复杂的例子。

数字的平方

假设管道有三个阶段。

第一步，gen函数,是一个将数字列表转换到一个channel中的函数。Gen函数启动了一个goroutine，将数字发送到channel，并在所有数字都发送完后关闭channel。

 
 
  
  func gen(nums ...int) <-chan int { 
  
      out := make(chan int) 
  
      go func() { 
  
          for _, n := range nums { 
  
              out <- n 
  
          } 
  
          close(out) 
  
      }() 
  
      return out 
  
  }

第二个阶段，sq，从上面的channel接收数字，并返回一个包含所有收到数字的平方的channel。在上游channel关闭后，这个阶段已经往下游发送完所有的结果，然后关闭输出channel：

  
 
   
  func sq(in <-chan int) <-chan int { 
   
      out := make(chan int) 
   
      go func() { 
   
          for n := range in { 
   
              out <- n * n 
   
          } 
   
          close(out) 
   
      }() 
   
      return out 
   
  }

main函数建立这个管道，并执行第一个阶段，从第二个阶段接收结果并逐个打印，直到channel被关闭。

 
 
  
  func main() { 
  
      // Set up the pipeline. 
  
      c := gen(2, 3) 
  
      out := sq(c) 
  
    
  
      // Consume the output. 
  
      fmt.Println(<-out) // 4 
  
      fmt.Println(<-out) // 9 
  
  }

因为sq对输入channel和输出channel拥有相同的类型，我们可以任意次的组合他们。我们也可以像其他阶段一样，将main函数重写成一个循环遍历。

 
 
  
  func main() { 
  
      // Set up the pipeline and consume the output. 
  
      for n := range sq(sq(gen(2, 3))) { 
  
          fmt.Println(n) // 16 then 81 
  
      } 
  
  }

扇出扇入（Fan-out, fan-in）

多个函数可以从同一个channel读取数据，直到这个channel关闭，这叫扇出。这是一种多个工作实例分布式地协作以并行利用CPU和I/O的方式。

一个函数可以从多个输入读取并处理数据，直到所有的输入channel都被关闭。这个函数会将所有输入channel导入一个单一的channel。这个单一的channel在所有输入channel都关闭后才会关闭。这叫做扇入。

我们可以设置我们的管道执行两个sq实例，每一个实例都从相同的输入channel读取数据。我们引入了一个新的函数，merge，来扇入结果:

 
 
  
  func main() { 
  
      in := gen(2, 3) 
  
    
  
      // Distribute the sq work across two goroutines that both read from in. 
  
      c1 := sq(in) 
  
      c2 := sq(in) 
  
    
  
      // Consume the merged output from c1 and c2. 
  
      for n := range merge(c1, c2) { 
  
          fmt.Println(n) // 4 then 9, or 9 then 4 
  
      } 
  
  }

merge函数为每一个输入channel启动一个goroutine，goroutine将数据拷贝到同一个输出channel。这样就将多个channel转换成一个channel。一旦所有的output goroutine启动起来，merge就启动另一个goroutine，在所有输入拷贝完毕后关闭输出channel。
向一个关闭了的channel发送数据会触发异常，所以在调用close之前确认所有的发送动作都执行完毕很重要。sync.WaitGroup类型为这种同步提供了一种简便的方法:

 
 
  
  func merge(cs ...<-chan int) <-chan int { 
  
      var wg sync.WaitGroup 
  
      out := make(chan int) 
  
    
  
      // Start an output goroutine for each input channel in cs.  output 
  
      // copies values from c to out until c is closed, then calls wg.Done. 
  
      output := func(c <-chan int) { 
  
          for n := range c { 
  
              out <- n 
  
          } 
  
          wg.Done() 
  
      } 
  
      wg.Add(len(cs)) 
  
      for _, c := range cs { 
  
          go output(c) 
  
      } 
  
    
  
      // Start a goroutine to close out once all the output goroutines are 
  
      // done.  This must start after the wg.Add call. 
  
      go func() { 
  
          wg.Wait() 
  
          close(out) 
  
      }() 
  
      return out 
  
  }

停止的艺术

我们所有的管道函数都遵循一种模式：

发送者在发送完毕时关闭其输出channel。
接收者持续从输入管道接收数据直到输入管道关闭。

这种模式使得每一个接收函数都能写成一个range循环，保证所有的goroutine在数据成功发送到下游后就关闭。

但是在真实的案例中，并不是所有的输入数据都需要被接收处理。有些时候是故意这么设计的：接收者可能只需要数据的子集就够了；或者更一般的，因为输入数据有错误而导致接收函数提早退出。上面任何一种情况下，接收者都不应该继续等待后续的数据到来，并且我们希望上游函数停止生成后续步骤已经不需要的数据。

在我们的管道例子中，如果一个阶段无法消费所有的输入数据，那些发送这些数据的goroutine就会一直阻塞下去：

 
 
  
      // Consume the first value from output. 
  
      out := merge(c1, c2) 
  
      fmt.Println(<-out) // 4 or 9 
  
      return
  
      // Since we didn't receive the second value from out, 
  
      // one of the output goroutines is hung attempting to send it. 
  
  }

这是一种资源泄漏：goroutine会占用内存和运行时资源。goroutine栈持有的堆引用会阻止GC回收资源。而且goroutine不能被垃圾回收，必须主动退出。

我们必须重新设计管道中的上游函数，在下游函数无法接收所有输入数据时退出。一种方法就是让输出channel拥有一定的缓存。缓存可以存储一定数量的数据。如果缓存空间足够，发送操作就会马上返回:

 
 
  
  c := make(chan int, 2) // buffer size 2 
  
  c <- 1  // succeeds immediately 
  
  c <- 2  // succeeds immediately 
  
  c <- 3  // blocks until another goroutine does <-c and receives 1

如果在channel创建时就知道需要发送数据的数量，带缓存的channel会简化代码。例如，我们可以重写gen函数，拷贝一系列的整数到一个带缓存的channel而不是创建一个新的goroutine：

 
 
  
  func gen(nums ...int) <-chan int { 
  
      out := make(chan int, len(nums)) 
  
      for _, n := range nums { 
  
          out <- n 
  
      } 
  
      close(out) 
  
      return out 
  
  }

反过来我们看管道中被阻塞的goroutine，我们可以考虑为merge函数返回的输出channel增加一个缓存：

 
 
  
  func merge(cs ...<-chan int) <-chan int { 
  
      var wg sync.WaitGroup 
  
      out := make(chan int, 1) // enough space for the unread inputs 
  
      // ... the rest is unchanged ...

虽然这样可以避免了程序中goroutine的阻塞，但这是很烂的代码。选择缓存大小为1取决于知道merge函数接收数字的数量和下游函数消费数字的数量。这是很不稳定的：如果我们向gen多发送了一个数据，或者下游函数少消费了数据，我们就又一次阻塞了goroutine。

然而，我们需要提供一种方式，下游函数可以通知上游发送者下游要停止接收数据。

#p#

显式取消

当main函数决定在没有从out接收所有的数据而要退出时，它需要通知上游的goroutine取消即将发送的数据。可以通过向一个叫做done的channel发送数据来实现。因为有两个潜在阻塞的goroutine，main函数会发送两个数据：

 
 
  
  func main() { 
  
      in := gen(2, 3) 
  
    
  
      // Distribute the sq work across two goroutines that both read from in. 
  
      c1 := sq(in) 
  
      c2 := sq(in) 
  
    
  
      // Consume the first value from output. 
  
      done := make(chan struct{}, 2) 
  
      out := merge(done, c1, c2) 
  
      fmt.Println(<-out) // 4 or 9 
  
    
  
      // Tell the remaining senders we're leaving. 
  
      done <- struct{}{} 
  
      done <- struct{}{} 
  
  }

对发送goroutine而言，需要将发送操作替换为一个select语句，要么out发生发送操作，要么从done接收数据。done的数据类型是空的struct，因为其值无关紧要：仅仅表示out需要取消发送操作。output 继续在输入channel循环执行，因此上游函数是不会阻塞的。（接下来我们会讨论如何让循环提早退出）

 
 
  
  func merge(done <-chan struct{}, cs ...<-chan int) <-chan int { 
  
      var wg sync.WaitGroup 
  
      out := make(chan int) 
  
    
  
      // Start an output goroutine for each input channel in cs.  output 
  
      // copies values from c to out until c is closed or it receives a value 
  
      // from done, then output calls wg.Done. 
  
      output := func(c <-chan int) { 
  
          for n := range c { 
  
              select { 
  
              case out <- n: 
  
              case <-done: 
  
              } 
  
          } 
  
          wg.Done() 
  
      } 
  
      // ... the rest is unchanged ...

这种方法有一个问题：每一个下游函数需要知道潜在可能阻塞的上游发送者的数量，以发送响应的信号让其提早退出。跟踪这些数量是无趣的而且很容易出错。

我们需要一种能够让未知或无界数量的goroutine都能够停止向下游发送数据的方法。在Go中，我们可以通过关闭一个channel实现。因为从一个关闭了的channel执行接收操作总能马上成功，并返回相应数据类型的零值。

这意味着main函数仅通过关闭done就能实现将所有的发送者解除阻塞。关闭操作是一个高效的对发送者的广播信号。我们扩展管道中所有的函数接受done作为一个参数，并通过defer来实现相应channel的关闭操作。因此，无论main函数在哪一行退出都会通知上游退出。

 
 
  
  func main() { 
  
      // Set up a done channel that's shared by the whole pipeline, 
  
      // and close that channel when this pipeline exits, as a signal 
  
      // for all the goroutines we started to exit. 
  
      done := make(chan struct{}) 
  
      defer close(done) 
  
    
  
      in := gen(done, 2, 3) 
  
    
  
      // Distribute the sq work across two goroutines that both read from in. 
  
      c1 := sq(done, in) 
  
      c2 := sq(done, in) 
  
    
  
      // Consume the first value from output. 
  
      out := merge(done, c1, c2) 
  
      fmt.Println(<-out) // 4 or 9 
  
    
  
      // done will be closed by the deferred call. 
  
  }

现在每一个管道函数在done被关闭后就可以马上返回了。merge函数中的output可以在接收管道的数据消费完之前返回，因为output函数知道上游发送者sq会在done关闭后停止产生数据。同时，output通过defer语句保证wq.Done会在所有退出路径上调用。

 
 
  
  func merge(done <-chan struct{}, cs ...<-chan int) <-chan int { 
  
      var wg sync.WaitGroup 
  
      out := make(chan int) 
  
    
  
      // Start an output goroutine for each input channel in cs.  output 
  
      // copies values from c to out until c or done is closed, then calls 
  
      // wg.Done. 
  
      output := func(c <-chan int) { 
  
          defer wg.Done() 
  
          for n := range c { 
  
              select { 
  
              case out <- n: 
  
              case <-done: 
  
                  return
  
              } 
  
          } 
  
      } 
  
      // ... the rest is unchanged ...

类似的，sq也可以在done关闭后马上返回。sq通过defer语句使得任何退出路径都能关闭其输出channel out。

 
 
  
  func sq(done <-chan struct{}, in <-chan int) <-chan int { 
  
      out := make(chan int) 
  
      go func() { 
  
          defer close(out) 
  
          for n := range in { 
  
              select { 
  
              case out <- n * n: 
  
              case <-done: 
  
                  return
  
              } 
  
          } 
  
      }() 
  
      return out 
  
  }

管道构建的指导思想如下：

每一个阶段在所有发送操作完成后关闭输出channel。
每一个阶段持续从输入channel接收数据直到输入channel被关闭或者生产者被解除阻塞（译者：生产者退出）。

管道解除生产者阻塞有两种方法：要么保证有足够的缓存空间存储将要被生产的数据，要么显式的通知生产者消费者要取消接收数据。

树形摘要

让我们来看一个更为实际的管道。

MD5是一个信息摘要算法，对于文件校验非常有用。命令行工具md5sum很有用，可以打印一系列文件的摘要值。

 
 
  
  % md5sum *.go 
  
  d47c2bbc28298ca9befdfbc5d3aa4e65  bounded.go 
  
  ee869afd31f83cbb2d10ee81b2b831dc  parallel.go 
  
  b88175e65fdcbc01ac08aaf1fd9b5e96  serial.go

我们的例子程序和md5sum类似，但是接受一个单一的文件夹作为参数，打印该文件夹下每一个普通文件的摘要值，并按路径名称排序。

 
 
  
  % go run serial.go . 
  
  d47c2bbc28298ca9befdfbc5d3aa4e65  bounded.go 
  
  ee869afd31f83cbb2d10ee81b2b831dc  parallel.go 
  
  b88175e65fdcbc01ac08aaf1fd9b5e96  serial.go

我们程序的main函数调用一个工具函数MD5ALL，该函数返回一个从路径名称到摘要值的哈希表，然后排序并输出结果：

 
 
  
  func main() { 
  
      // Calculate the MD5 sum of all files under the specified directory, 
  
      // then print the results sorted by path name. 
  
      m, err := MD5All(os.Args[1]) 
  
      if err != nil { 
  
          fmt.Println(err) 
  
          return
  
      } 
  
      var paths []string 
  
      for path := range m { 
  
          paths = append(paths, path) 
  
      } 
  
      sort.Strings(paths) 
  
      for _, path := range paths { 
  
          fmt.Printf("%x  %s\n", m[path], path) 
  
      } 
  
  }

MD5ALL是我们讨论的核心。在 serial.go中，没有采用任何并发，仅仅遍历文件夹，读取文件并求出摘要值。

 
 
  
  // MD5All reads all the files in the file tree rooted at root and returns a map 
  
  // from file path to the MD5 sum of the file's contents.  If the directory walk 
  
  // fails or any read operation fails, MD5All returns an error. 
  
  func MD5All(root string) (map[string][md5.Size]byte, error) { 
  
      m := make(map[string][md5.Size]byte) 
  
      err := filepath.Walk(root, func(path string, info os.FileInfo, err error) error { 
  
          if err != nil { 
  
              return err 
  
          } 
  
          if info.IsDir() { 
  
              return nil 
  
          } 
  
          data, err := ioutil.ReadFile(path) 
  
          if err != nil { 
  
              return err 
  
          } 
  
          m[path] = md5.Sum(data) 
  
          return nil 
  
      }) 
  
      if err != nil { 
  
          return nil, err 
  
      } 
  
      return m, nil 
  
  }

#p#

并行摘要求值

在parallel.go中，我们将MD5ALL分成两阶段的管道。第一个阶段，sumFiles，遍历文件夹，每个文件一个goroutine进行求摘要值，然后将结果发送一个数据类型为result的channel中：

 
 
  
  type result struct { 
  
      path string 
  
      sum  [md5.Size]byte
  
      err  error 
  
  }

sumFiles 返回两个channel：一个用于生成结果，一个用于filepath.Walk返回错误。Walk函数为每一个普通文件启动一个goroutine，然后检查done，如果done被关闭，walk马上就会退出。

 
 
  
  func sumFiles(done <-chan struct{}, root string) (<-chan result, <-chan error) { 
  
      // For each regular file, start a goroutine that sums the file and sends 
  
      // the result on c.  Send the result of the walk on errc. 
  
      c := make(chan result) 
  
      errc := make(chan error, 1) 
  
      go func() { 
  
          var wg sync.WaitGroup 
  
          err := filepath.Walk(root, func(path string, info os.FileInfo, err error) error { 
  
              if err != nil { 
  
                  return err 
  
              } 
  
              if info.IsDir() { 
  
                  return nil 
  
              } 
  
              wg.Add(1) 
  
              go func() { 
  
                  data, err := ioutil.ReadFile(path) 
  
                  select { 
  
                  case c <- result{path, md5.Sum(data), err}: 
  
                  case <-done: 
  
                  } 
  
                  wg.Done() 
  
              }() 
  
              // Abort the walk if done is closed. 
  
              select { 
  
              case <-done: 
  
                  return errors.New("walk canceled") 
  
              default: 
  
                  return nil 
  
              } 
  
          }) 
  
          // Walk has returned, so all calls to wg.Add are done.  Start a 
  
          // goroutine to close c once all the sends are done. 
  
          go func() { 
  
              wg.Wait() 
  
              close(c) 
  
          }() 
  
          // No select needed here, since errc is buffered. 
  
          errc <- err 
  
      }() 
  
      return c, errc 
  
  }

MD5All 从c中接收摘要值。MD5All 在遇到错误时提前退出，通过defer关闭done。

 
 
  
  func MD5All(root string) (map[string][md5.Size]byte, error) { 
  
      // MD5All closes the done channel when it returns; it may do so before 
  
      // receiving all the values from c and errc. 
  
      done := make(chan struct{}) 
  
      defer close(done) 
  
    
  
      c, errc := sumFiles(done, root) 
  
    
  
      m := make(map[string][md5.Size]byte) 
  
      for r := range c { 
  
          if r.err != nil { 
  
              return nil, r.err 
  
          } 
  
          m[r.path] = r.sum 
  
      } 
  
      if err := <-errc; err != nil { 
  
          return nil, err 
  
      } 
  
      return m, nil 
  
  }

有界并行

parallel.go中实现的MD5ALL，对每一个文件启动了一个goroutine。在一个包含大量大文件的文件夹中，这会导致超过机器可用内存的内存分配。（译者注：即发生OOM）

我们可以通过限制读取文件的并发度来避免这种情况发生。在bounded.go中，我们通过创建一定数量的goroutine读取文件。现在我们的管道现在有三个阶段：遍历文件夹，读取文件并计算摘要值，收集摘要值。

第一个阶段，walkFiles，输出文件夹中普通文件的文件路径：

 
 
  
  func walkFiles(done <-chan struct{}, root string) (<-chan string, <-chan error) { 
  
      paths := make(chan string) 
  
      errc := make(chan error, 1) 
  
      go func() { 
  
          // Close the paths channel after Walk returns. 
  
          defer close(paths) 
  
          // No select needed for this send, since errc is buffered. 
  
          errc <- filepath.Walk(root, func(path string, info os.FileInfo, err error) error { 
  
              if err != nil { 
  
                  return err 
  
              } 
  
              if info.IsDir() { 
  
                  return nil 
  
              } 
  
              select { 
  
              case paths <- path: 
  
              case <-done: 
  
                  return errors.New("walk canceled") 
  
              } 
  
              return nil 
  
          }) 
  
      }() 
  
      return paths, errc 
  
  }

中间的阶段启动一定数量的digester goroutine，从paths接收文件名称，并向c发送result结构:

 
 
  
  func digester(done <-chan struct{}, paths <-chan string, c chan<- result) { 
  
      for path := range paths { 
  
          data, err := ioutil.ReadFile(path) 
  
          select { 
  
          case c <- result{path, md5.Sum(data), err}: 
  
          case <-done: 
  
              return
  
          } 
  
      } 
  
  }

和前一个例子不同，digester并不关闭其输出channel，因为输出channel是共享的，多个goroutine会向同一个channel发送数据。MD5All 会在所有的digesters 结束后关闭响应的channel。

 
 
  
  // Start a fixed number of goroutines to read and digest files. 
  
  c := make(chan result) 
  
  var wg sync.WaitGroup 
  
  const numDigesters = 20
  
  wg.Add(numDigesters) 
  
  for i := 0; i < numDigesters; i++ { 
  
      go func() { 
  
          digester(done, paths, c) 
  
          wg.Done() 
  
      }() 
  
  } 
  
  go func() { 
  
      wg.Wait() 
  
      close(c) 
  
  }()

我们也可以让每一个digester创建并返回自己的输出channel，但如果这样的话，我们需要额外的goroutine来扇入这些结果。
最后一个阶段从c中接收所有的result数据，并从errc中检查错误。这种检查不能在之前的阶段做，因为在这之前，walkFiles 可能被阻塞不能往下游发送数据：

 
 
  
      m := make(map[string][md5.Size]byte) 
  
      for r := range c { 
  
          if r.err != nil { 
  
              return nil, r.err 
  
          } 
  
          m[r.path] = r.sum 
  
      } 
  
      // Check whether the Walk failed. 
  
      if err := <-errc; err != nil { 
  
          return nil, err 
  
      } 
  
      return m, nil 
  
  }

结论

这篇文章介绍了如果用Go构建流式数据管道的技术。在这样的管道中处理错误有点取巧，因为管道中每一个阶段可能被阻塞不能往下游发送数据，下游阶段可能已经不关心输入数据。我们展示了关闭channel如何向所有管道启动的goroutine广播一个done信号，并且定义了正确构建管道的指导思想。

深入阅读：

• Go并发模式（视频）展示了Go并发原语的基本概念和几个实现的方法

• 高级Go并发模式（视频）包含几个更为复杂的Go并发原语的使用，尤其是select

• Douglas McIlroy的Squinting at Power Series论文展示了类似Go的并发模式如何为复杂的计算提供优雅的支持。

当前标题：Go并发模式：管道和显式取消
浏览路径：http://www.shufengxianlan.com/qtweb/news33/341683.html

网站建设、网络推广公司-创新互联，是专注品牌与效果的网站制作，网络营销seo公司；服务项目有等

声明：本网站发布的内容（图片、视频和文字）以用户投稿、用户转载内容为主，如果涉及侵权请尽快告知，我们将会在第一时间删除。文章观点不代表本网站立场，如需处理请联系客服。电话：028-86922220；邮箱：631063699@qq.com。内容未经允许不得转载，或转载时需注明来源：创新互联

猜你还喜欢下面的内容