In the beginning, there was no need to build anything. We manipulated the toggle switches on the front panel of a mainframe, and the code either ran or it didn’t. Creating software was very straightforward. Maybe not easy, but certainly uncomplicated.
Today, every line of code could pass through a dozen or more steps. A software developer’s—or should we say “CPU instruction artist’s”?— magical keystrokes are transformed to create machine executables. This process, often called the build, is so elaborate that even medium-sized teams may dedicate one or more full-time engineers to managing the complicated dance.
How did we get here? Well, over the years, software developers solved problems by adding more steps to the build. Is the code too slow? Add an optimization process. Is the program suddenly breaking? Add some unit tests. Are there blind spots in the collection of unit tests? Add meta-tests that test the tests to see if they’re sufficiently testy.
All this clever innovation adds up. From time to time, it makes sense to pause and examine the state of the modern build pipeline. What’s new in the field? Are we still doing it right? Or is the old mechanism still working well enough that we don’t need to change anything? In the interest of advancing the field, here are nine new ways to build that may or may not be worth adding to your future build pipelines—or may even replace them completely.
1JPM: The native build language
In the past, Java developers relied on popular build tools like Ant, Maven, or Gradle. These handled the dependencies on standard libraries, executed various tests, and offered a nice, centralized mechanism for organizing all the steps in the process. Using them just required spelling out the required steps in either XML (for Ant or Maven) or Groovy or Kotlin (for Gradle).
1JPM tosses all that aside. Instead of mastering a completely different syntax and language, 1JPM lets you write pure Java. That’s right, the build instructions for a Java program are written in Java itself.
1JPM is essentially a wrapper around Maven, so using it gives you access to all the infrastructure built up around it. 1JPM’s developers realized that there were so many plugins and repositories for Maven that it would be foolish to try to duplicate them. 1JPM simply transforms your single Java build file into a Maven build file, the classic pom.xml
.
Use a notebook instead
Notebook formats are among the most popular ways to deliver code for data science, and serve as a replacement for traditional builds in some settings. The notebook format mixes regular text with blocks of code and access to data in a way that lets the reader absorb the message of the text while clicking on the code to run it. Readers can’t complain that the author is hiding any information or failing to adequately describe any data analysis steps because everything is bundled together in one place.
There are now several good notebook formats supporting various languages. The original format, the IPython Notebook (.ipynb
), is still one of the most common underlying formats for Jupyter notebook websites. It led to other formats that imitate the style of bundling rich text, interactive graphics, and raw code—formats like R markdown (.rmd
) built around the R language, Quarto (.qmd
) if you want to combine R and Python, and Myst Markdown (.md
), which folds R and Python code into regular markdown files.
There are even some meta formats like JupyterLab (.jlpb
) and Observable Notebook (.ipynb
), which bundle together multiple notebooks into websites for collecting data and observations.
The success of notebooks has encouraged some to adapt Jupyter Notebooks for their language by creating a separate kernel that manages the work of compiling and running code. The most common include languages popular in the scientific community like MATLAB, Julia, Java, Scala, Scheme, and Octave.
Low-code build automation
The buzzwords “low code” and “no code” are usually aimed at non-programmers who never want to build anything. Sadly, they rarely deliver the level of no-sweat magic that these users really want. But even if it’s difficult to abstract away all the complexity of writing good software, some low-code platforms are finding fans in developers’ cubicles. It turns out these tools can automate much of the build pipeline, so developers can concentrate on crafting the essence of the work.
REI3 is just one example of a low-code approach to building web applications with just a few lines. It’s meant to support the kind of work that’s just a bit too complex for a spreadsheet. The open source system distributed under the MIT license can be installed locally or in the cloud. Users, who are often developers, can then create applications to juggle the data. When functions need to be written in code, there’s a built-in editor that acts like an IDE. The system handles the entire process of building and distributing applications.
This approach is now pretty common among the different data analysis and system management tools distributed for the enterprise marketplace. Businesses can spin up data-storage back ends, then let users and developers build out the analytical dashboards necessary for keeping the business running.
Java as a scripting language
Long ago, Java was a compiled language for big projects and JavaScript was its tiny cousin, glued onto the web browser for doing a bit of quick automation. Over the years, smart developers have evolved JavaScript into a vast ecosystem that supports big projects with just-in-time compilation. And at the same time, Java has gone in the opposite direction, imitating a scripting language.
It’s now possible to write a Java program in a single file, call it X.java
, and then all it takes to run it is the command-line incantation of java X.java
. The Java platform infrastructure does everything else. All the compilation and other building is done under the hood. Programming with X.java
is so simplified that developers working on smaller programs don’t even have to bother with build configuration.
Cost estimates in the build pipeline
How much will this code cost to run? How much hardware will it need? Developers aren’t traditionally attuned to questions like these because they concentrate on metrics like execution time and RAM consumption. Even knowing costs may be significant when success strikes and hordes of users come knocking, developers are often busy adding extra layers of customization and configuration just to make code easier to maintain. How to pay for it all is another department’s concern.
Now, tools like Infracost, Finout, LimeOps are helping devops teams keep costs under control by making financial estimates available all along the build pipeline. As an example, a team might use one of these tools to get an estimate of how much a pull request would save—or cost. Having this data easily available and built-in at the level of design and architectural decisions helps everyone make better decisions.
Documentation in the build pipeline
In the dark ages of programming, documentation was kind of an afterthought, something to do when all of the heavy lifting and bitbanging was done. Lately, like cost estimates, documentation is becoming part of the build process itself.
Some languages offer specific formats like JavaDoc or PyDoc that define a rigid structure for the comment sections. Comments in these formats can automatically be turned into documentation pages on websites.
Other languages are going further. Some data scientists who craft R or Python use tools like Sweave, Pweave, or knitr to write their final presentation and code at the same time. The idea is to bundle the data analysis software with the data analysis text so that the tables, graphs, and figures are generated at build time, directly from the raw data. The build pipeline first executes the R or Python code and then glues these results into some LaTeX that will be typeset into a finished document. When the process works well, anyone coming along later will find both the raw data and the raw text, so they can completely understand the science.
DSLs for simpler builds
While the people who built domain-specific languages didn’t think they were working on the build pipeline, they’ve somehow simplified it anyway. DSLs often require many fewer libraries because much of the complexity for particular tasks has been absorbed by the definition of the language itself. In some cases, semi-standard tests are also included.
In some cases, the DSL is part of a larger tool system or integrated development environment that can simplify the build process even further. Some good examples include game development languages like UnrealScript, financial analysis foundations like QuantLib, or robot optimization software like the Robot Operating System.
Locally generated libraries
In the past, programmers were happy to let someone else build their libraries and other components. They would just download a binary and be done with it.
But as more stories emerged about libraries that were corrupted by backdoors or other malware, some of the most careful and paranoid developers and teams started including an additional step to the build process, of rebuilding third-party libraries locally. Sure, there could still be some strange things hidden in this code, but at least it is more likely to be spotted.
Nowadays, the most careful teams are quietly rebuilding all their third-party libraries locally to mitigate the danger of a compromised codebase.
GenAI in the build pipeline
One of the most surprising and humbling developments is how successful large language models are at creating code. All you need to do is present a programming query and the LLMs can produce very good results written in syntactically correct Python, Java, or any of a dozen other major languages.
It’s not yet as obvious how AIs can help with the build pipeline. In the last few weeks, I’ve been iterating on several applications while asking various LLMs to write the code. While they’re often able to do up to 95% of a task perfectly, they still get several things wrong. When I point out the problem, the LLMs respond very politely, “You’re absolutely right …” If they know it after I point it out, why didn’t they know it beforehand? Why couldn’t they finish the last 5% of the job?
That’s a question for the future. For now, build engineers are finding other ways to use LLMs. Some are summarizing code to produce better high-level documentation. Some are using natural language search to ask an AI companion where a bug started. Others are trusting LLMs to refactor their code to improve reusability and maintenance. One of the most common applications is creating better and more comprehensive test cases.
LLMs are still evolving, and we’re still understanding how well they can reason and where they are likely to fail. We’re discovering just how much context they can absorb and how they can improve our code. They will add more and more to the build process, but it will be some time before those improvements appear. Until then, we’re going to need to manage how the parts come together. In other words, we humans will still have a job maintaining the build pipeline.