LLeessssoonnss ffrroomm BBuuiillddiinngg IInnssiiddee aa MMoonnoorreeppoo Published January 6, 2022 For the last seven years, all my code has lived in a monorepo. This article documents some lessons Iʼve learned building software that way. The emphasis is build tools; not a cost-benefit tradeoffs monorepos in general. WWoorrkkffllooww My monorepo has a single, monolithic _M_a_k_e_f_i_l_e at its root that describes: all the repositoryʼs outputs, the scripts that produce them, and the inputs they need. (Itʼs not literally a _M_a_k_e_f_i_l_e, but for the sake of this article we can pretend). A program (specific to this repository) called ggeenn--mmaakkee produces the _M_a_k_e_f_i_l_e. I run ggeenn--mmaakkee manually whenever I add or remove files from the tree, or change dependencies between packages. It walks the repository, locates files of interest, analyzes dependencies between projects, and produces a _M_a_k_e_f_i_l_e describing what it found. ggeenn--mmaakkee takes 2 seconds to run. When I change any file in the repository, I run mmaakkee (letʼs pretend) to rebuild the entire repository. A typical no-op build takes about 300 ms. Thatʼs the overhead for checking which of 45,000 files have changed (including file checksums if necessary). Beyond that overhead, a build takes however long ggoo bbuuiilldd or ccllaanngg would take if I had run them manually; usually less than 1 second. I love that I can make a small change somewhere and automatically rebuild everything it touched. Itʼs especially helpful when updating libraries. If my change breaks a consumer, I know it right away. I never build individual outputs. I always build the entire repository. This wouldnʼt be possible if I had to rely on ggoo bbuuiilldd or ccllaanngg to check dependencies. The monolithic _M_a_k_e_f_i_l_e has a global view letting it skip huge chunks of the build with just a few _s_t_a_t_(_2_)[1] calls. Itʼs amazing how much duplicate effort is performed when running two builds independently. A global view avoids most of that. I also love how this build process works gracefully for every output I care about: executables, libraries, static HTML files, man pages, etc. A few changes to ggeenn--mmaakkee define a new output format. Finally, file checksums (not part of _m_a_k_e_(_1_)[2]) are vital. I can run scripts, like _d_a_t_e_(_1_)[3] or ggiitt ffeettcchh, and only perform builds if the output changes, not the timestamp. PPrroobblleemm:: ccoonnffiigguurraattiioonn iiss ttoooo gglloobbaall Unfortunately, ggeenn--mmaakkee has too much responsibility. When I first created ggeenn--mmaakkee, every project was the same: run ggoo bbuuiilldd to produce a single binary. Thatʼs not the case anymore. Just within the Go ecosystem some subprojects need _g_o _g_e_n_e_r_a_t_e[4], some want _g_o _r_u_n[5], some produce multiple executables, some output JavaScript via _g_o_p_h_e_r_j_s[6]. Of course, there are also C projects, OCaml projects, static site generators, scripts that check the Internet for updated data files, etc. This is how build systems grow into unwieldy monsters. No matter how much effort I put into simplifying ggeenn--mmaakkee, its domain is inherently too complex, so its code remains complex. Configuration for one project sometimes interferes with configuration for another project because ggeenn--mmaakkee acts like shared, mutable state for them both. That leads to exponential complexity. _W_i_s_h_l_i_s_t: separate configuration for each subproject. I still want the benefits of a global _M_a_k_e_f_i_l_e, but each subproject should define its own needs. Maybe each subproject has its own ggeenn--mmaakkee script whose output is merged into the main _M_a_k_e_f_i_l_e. Presumably subprojects with a standard layout would call out to ggeenn--mmaakkee--ggoo or ggeenn--mmaakkee--cc from their local ggeenn--mmaakkee. This layout would would avoid interference between Go rules and C rules, for example. PPrroobblleemm:: ffoorrggeettttiinngg ttoo rruunn ggeenn--mmaakkee To nobodyʼs surprise, I often forget to run ggeenn--mmaakkee when I add a new file or change dependencies. Sometimes I notice this immediately because mmaakkee insists that thereʼs nothing to build. Other times I notice it the next day when I run ggeenn--mmaakkee for something unrelated and see the _M_a_k_e_f_i_l_e change unexpectedly. _W_i_s_h_l_i_s_t: automatically run ggeenn--mmaakkee if doing so would change _M_a_k_e_f_i_l_e. Since it takes 2 seconds, itʼs too slow to run it on every build (like I did when the monorepo was small). Recursive make is tempting here (pause to let _r_e_d_o[7] fans have their moment). If the semantics were rich enough, a local ggeenn--mmaakkee script could produce a local _M_a_k_e_f_i_l_e that regenerates itself when directories change (new file, removed file). Then you could merge each subprojectʼs _M_a_k_e_f_i_l_e into a global _M_a_k_e_f_i_l_e. On a given build, most ggeenn--mmaakkee scripts wouldnʼt run so this should be fast. PPrroobblleemm:: ssuubbttllee ddeeppeennddeenncciieess Building Hello World with ggoo bbuuiilldd depends on the obvious things like _h_e_l_l_o_._g_o but it also depends on _/_u_s_r_/_l_o_c_a_l_/_b_i_n_/_g_o and the absence (or presence) of _g_o_._m_o_d in the current directory or its ancestors. Depending on how tools change between releases, these dependencies can change in unexpected ways. I try to teach ggeenn--mmaakkee about these dependencies, but I miss one often enough to notice. _W_i_s_h_l_i_s_t: use _k_t_r_a_c_e_(_1_)[8] and _k_d_u_m_p_(_1_)[9] to see everything that a build touches or considers touching. If any of those system calls might answer differently, the build needs to run again. I have a prototype of this and it works better than I expected. PPrroobblleemm:: ggoo..mmoodd,, ppaacckkaaggee..jjssoonn,, eett aall Itʼs common practice to have a single, standardized file describing all third-party libraries. When this file changes, a tool like nnppmm downloads the updated libraries to a local cache. Unfortunately, a small change to ppaacckkaaggee..jjssoonn (for example) often triggers a rebuild of everything in the repository. _W_i_s_h_l_i_s_t: let a _M_a_k_e_f_i_l_e say something like: if this file is being rebuilt, wait for it to finish but I donʼt actually care about its content. Because my make tool does file checksums, I can fake it with an empty file _p_a_c_k_a_g_e_._j_s_o_n_._d_o_n_e thatʼs created after downloading third-party libraries. A build rule lists _p_a_c_k_a_g_e_._j_s_o_n_._d_o_n_e as an input. The build rule for _p_a_c_k_a_g_e_._j_s_o_n_._d_o_n_e lists _p_a_c_k_a_g_e_._j_s_o_n as an input. This causes downstream builds to wait until new libraries are available but doesnʼt trigger a rebuild by itself. This feels like a hack, but maybe itʼs cleaner than adding these semantics to a build tool. 1: https://man.openbsd.org/OpenBSD-7.0/stat.2 2: https://man.openbsd.org/OpenBSD-7.0/make.1 3: https://man.openbsd.org/OpenBSD-7.0/date.1 4: https://go.dev/blog/generate 5: https://pkg.go.dev/cmd/go/internal/run 6: https://github.com/gopherjs/gopherjs 7: https://redo.readthedocs.io/en/latest/ 8: https://man.openbsd.org/OpenBSD-7.0/ktrace.1 9: https://man.openbsd.org/OpenBSD-7.0/kdump.1