Explorations in Go: A dupe checker in Go and Ruby

Posted on by in Everything Else

I've recently started exploring the new(ish) programming language "Go". Go is intended to be a systems programming language and offers speed and low-level API along with some sweet features and the beginnings of a great standard library.

At Carbon Five we do most of our work in Ruby, JavaScript, Objective-C, and lately, node.js. I've really been enjoying Objective-C but realistically half the value is in the standard library, which is not public, which precludes its use in any kind of server environment to which we're likely to deploy (i.e. non-Mac). I also spent some time this year contributing C code to an open source project and remembering why I don't program in C, given the choice.

I'm exploring Go as a possible solution for times when I want to get close-to-the-metal and really control what's going on, but without having to reinvent the wheel. Example: having to write your own collection frameworks. Blehft.

I wrote a dupe checker in Ruby for a project recently and thought I'd write a Go port as an experiment. (In case you're curious about the motivation: tons of files get moved from a tree to a flat namespace, leading to collisions -- and confusion.)

I've written walkthroughs of the Ruby and Go versions below - please check it out.

The code

Source code is here: https://github.com/joncooper/go/tree/master/dupe

Argument parsing

Ruby

Go

Longer than the Ruby idiom, for sure, but still a breeze compared to, say, doing in in straight C.

The 'flag' package that I've used here gets us some niceties, too, such as the ability to print a help message:

Filesystem traversal

Ruby

Reading Ruby code is a pleasure. Except for one mildly gnarly (yet transparent to Rubyists) bit, this is pretty darn self-documenting.

Go

Even you ignore the fact that Go syntax looks a bit wonky compared to mainstream languages, this is still a bit confusing. It is, however, quite concise compared to a C version, again because of the kick-ass Go standard library.

I use Go's filepath.Walk in place of Ruby's Find.find. The function signature looks like this:

To use this function, we need something that implements the filepath.Visitor interface:

This interface defines two functions, which, if implemented, mean that the implementor implements the Visitor interface. Say that ten times fast. Or maybe don't. The point is that in Go, you don't make a claim that any given type implements an interface. You just do so, and then the (static) type inference system can perform safety (and sanity) checks at compile-time.

Aside aside, the func VisitDir simply returns a boolean letting the caller know whether or not to traverse into the directory in question, and the func VisitFile does something with a file found during the traversal.

http://golang.org/pkg/path/filepath/#Walk

MD5 hashing files

Ruby

Easy.

Go

Pretty nice for a language aimed at systems programming.

One could of course do this in C by linking an external library. (You could also just exec 'md5sum'). The point is that MD5 is part of the Go standard library, so you don't have to.

Take note also of the optional precondition to the if conditional. This helps clarify the typical POSIX C monkey business like:

(Note: this code is unrelated to the Go MD5OfFile(), it's just an example of how error handling is done in C.)

Printing results

Ruby

Two aspects of Ruby that I really like are on display here: string interpolation and functional operations on collections.

Interpolating variables or code into strings with "#{}" is an example of the "make common tasks easy" mentality that makes programming in Ruby so nice.

I am of two minds regarding the common use of functional operations on collections in Ruby, for example:

For me, this code is extremely legible, and expresses my intent clearly. "Choose only the elements of this collection where the fullpaths array is >= length 2."

On the other hand, it is clearly less performant than directly iterating over the collection, and if you do a lot of this kind of thing on big collections, performance starts to drag.

Go

I don't love the way that this looks compared to Ruby, but it is in the same ballpark in terms of expressivity. And it is vastly less horrible than doing this in C, even if you had a good implementation of a map and an iterator to work with, which you probably wouldn't.

Three quick notes: 1. the 'range' operator is an iterator; 2. assigning to _ means "throw this away"; and 3. applying the range operator to an array returns (index, value).

Benchmarking

I've applied a super craptastic benchmarking technique here: I just run each version three times.

On my home directory (49k-ish files):

Ruby

Go

Conclusion

I'm definitely going to be writing more Go in the future. The syntax has grown on me and I like the standard library and language features quite a bit. I'm interested to do a quick port of some UNIX C stuff in order to see what an improvement it makes in terms of intelligibility and maintainability; that's actually a much more fair comparison than a Ruby app.


Feedback

  Comments: 24

  1. Jesper Louis Andersen


    Try running the two things with /usr/bin/time -v and see how much of that run time is IO.


    • A profile of the Go code shows that most of the time is spent on IO. The majority of the time is in syscall.Syscall and scanblock:
      (pprof) top5
      Total: 53 samples
      18 34.0% 34.0% 18 34.0% syscall.Syscall
      8 15.1% 49.1% 11 20.8% scanblock

      Running one command after the other makes it even more obvious. The first run is slow, the second returns instantly because everything is cached.

      (the command ‘itime’ is just /usr/bin/time -f ‘%Uu %Ss %er %MkB %I inputs %C’)

      Clear your cache:
      $ sudo -i ‘echo 3 > /proc/sys/vm/drop_caches’
      $ itime ./dupe ~/Downloads/
      Total duped files found: 10
      0.08u 0.36s 43.69r 10304kB 124832 inputs ./dupe /home/bill/Downloads/
      $ itime ruby dupe.rb ~/Downloads/
      Total duped files found: 10
      0.06u 0.05s 0.49r 14240kB 3496 inputs ruby dupe.rb /home/bill/Downloads/

      Clear the cache again to prove to yourself that it works in reverse:
      $ sudo -i ‘echo 3 > /proc/sys/vm/drop_caches’
      $ itime ruby dupe.rb ~/Downloads/
      Total duped files found: 10
      0.16u 0.33s 43.30r 14224kB 126144 inputs ruby dupe.rb /home/bill/Downloads/
      $ itime ./dupe ~/Downloads/
      Total duped files found: 10
      0.04u 0.03s 0.12r 9904kB 2320 inputs ./dupe /home/bill/Downloads/

      The point of this article wasn’t to see whether Go or Ruby runs faster. Jon was trying to show how easily a person can whip up a quick program in either language. For the brevity, clarity, and ease of use that Ruby gets you, I see no reason to go with anything else. IO-intensive tasks demand a programming language that’s comfortable and easy. Clearly, Go is not that language.


      • Thanks, Bill, appreciate the detail.

        My point was just as you say: to explore how hard it is to do something that I’d normally do in Ruby in Go instead. I didn’t start with a performance benchmark in mind. That’s why I referred to it as a “craptastic benchmark”. :)


  2. Have you looked at the D programming language?
    I think there’s a big intersection between Go and D, but I suspect D is a more robust language that offers better defaults and more flexibility.

    I’ve only done the most bare amount of programming in either of the two languages, though; so don’t take my word for it!


    • By “robust” you mean bloated with every feature and paradigm ever conceived? While the various D compilers only implement various buggy subsets of the language.

      You have a very strange definition of “robust”

  3. Jacek Masiulaniec


    Many type declarations are redundant, the assignment operator can take care of them.

    These lines could be shortened:

    var md5sum hash.Hash = md5.New()
    md5sum.Write(contents)
    return md5sum.Sum()

    to:

    return md5.New().Write(contents).Sum()

    Also, loading entire file to memory can be wasteful; avoid ioutil.ReadFile for large files.

  4. Jacek Masiulaniec


    (I haven’t tested the code, but you get the general idea.)


  5. Your go program does not build with the latest version of go. Here is a fixed one:

    package main

    import (
    “crypto/md5″
    “flag”
    “fmt”
    “io/ioutil”
    “log”
    “os”
    “path”
    “path/filepath”
    )

    var verbose = flag.Bool(“verbose”, false, “Print the list of duplicate files.”)

    func MD5OfFile(path string) []byte {
    contents, err := ioutil.ReadFile(path)
    if err != nil {
    log.Fatal(err)
    }
    s := md5.New()
    s.Write(contents)
    return s.Sum()
    }

    func PrintResults(pathsByName map[string][]string) {
    dupes := 0
    for name, paths := range pathsByName {
    if len(paths) 0 {
    rootDir = flag.Arg(0)
    }
    paths := FindDupes(rootDir)
    PrintResults(paths)
    }


  6. You don’t need to put parens around the condition of “if”s in Go. If you run your Go code through gofmt (“gofmt -w -s filename” fixes in place and simplifies where it can), it’ll take care of that and all other formatting for you.

    Also, I’d write MD5OfFile like this:

    func MD5OfFile(fullpath string) []byte {
    if f, e := os.Open(fullpath); e == nil {
    defer f.Close()
    hash := md5.New()
    if _, e := io.Copy(hash, f); e == nil {
    return hash.Sum()
    }
    }
    return nil
    }

    Longer, and possibly slower, but works nicely on files of any size.

    Also possibly worth noting: the filepath.Walk api has changed to be a little more sane, though it isn’t part of the release version of Go yet.

    And, it’s a matter of taste, but if you’re going to use flag.Arg(0), probably best to check flag.NArg(). However, it may be nicer overall to do:

    if args := flag.Args(); len(args) > 0 {
    rootDir = args[0]
    }


    • Also, replace uses of “println” with fmt.Println. println is more of a bootstrapping function; it isn’t quite as smart, and it may write to stderr.


      • Thanks for the feedback; I will update the code in the repository, although I’m not sure I’ll update the blog post.


  7. Am I running the wrong version of Ruby?

    $ ruby dupe.rb
    dupe.rb:21:in `print_results': undefined method `select!’ for # (NoMethodError)
    from dupe.rb:38


    • Apparently, I am. select! doesn’t show up in Ruby until version 1.9.2.

      To get the code to work (if anyone has an older version of Ruby), change the offending line to:
      @full_paths_by_filename.delete_if { |filename, fullpaths| fullpaths.length < 2 }


  8. Nice intro Jon. I’ve been wanting to check out this language. I’m currently making my way through http://go-tour.appspot.com/


  9. Maybe a better way to cumpute the hash of a file in Go using a “pipe” :

    hash := md5.New()
    file, _ := os.Open(path) // check the error if you want
    io.Copy(hash, file) // Just pipe the file into the hasher !
    return hash.Sum() //et voilĂ 


  10. I’ve spent a few days learning about Go as well. It’s funny in the way that I had to take a second look at Ruby to like it. And with Go I also disregarded at first. But now I think it has gained momentum and made me to appreciate it more. I’ve watched videos, read some of the documentation and followed blogs and so on covering Go. The Go Tour has made me to get wet with the exercises.

    I can’t say I’m ready to jump ship from Ruby. Javascript just doesn’t feel as nice even with Node. The only nice thing is the excellent V8 engine. :-) Go is cool but we need the “Ruby batteries” included in it so we can program prototyping in it with faster iterations.


  11. I may have to include it in my upcoming ISP, a comparison of algorithms implemented in a selection of languages to see how language choice should influence algorithm implementation. One example is don’t recurse in Python (were it me I’d say don’t use Python at all…). I have one language I like (C), several I don’t like (Python, Perl, PHP) and one I’m on the fence about so far (Ruby). Any guess how Go will compare for speed for algorithms or operations on structures one might find in say Cormen, etc?


  12. I ported your program to Factor, which I found to be pretty clean (and faster than both the Go and Ruby examples).

    http://re-factor.blogspot.com/2012/01/duplicate-files.html


  13. are those code links broken perhaps? or is it missing some inlines?

Your feedback