Packages in the time of generative AI

Sean Gillies

2024-01-10 15:58

What does it mean? Where is this all headed? Does anybody know? Let's start from where we are. What do we know?

What do large language models and code generation mean for authors and maintainers of software packages and libraries? Where are we headed? I'm the author and maintainer of several packages that generative AI (dropping the scare quotes to save keystrokes) will recommend to you in the geospatial domain, and I don't have the answers to those questions. The sober analysis that I've read, like in Thoughtworks articles, tends to be written for engineering leaders building B2B software, and I haven't seen any insights specifically about packages or packaging.

Companies building and selling generative AI claim, of course, that the technology changes absolutely everything. Microsoft proposes to add a CoPilot button to keyboards, for example. That's how seriously they'd like us to take it. I'm less gung-ho. The technology stands to change some things, but what and by how much? I'm going to poke at the question a bit on my blog from the perspective of an open source package author and maintainer.

I haven't used any tools for generating code from a prompt, at my day job or in open source work. I like autocompletion of identifiers when working in an IDE, for sure, but I don't use anything more fancy than that. In starting to consider the impact of generative AI on packages, I'm not think of any particular model or product. I don't really care which of them are better or worse than others, or how much better ChatGPT 4.5 will be compared to last year's ChatGPT 4. I'm more interested in how developers will use the tech and how it may or may not change their expectations. I'm not enthusiastic or optimistic about any of this, but it seems dumb to put my head in the sand and ignore what's happening in the industry.

For better or worse, many developers and analysts seem cautiously to wildly enthusiastic about tools like ChatGPT and CoPilot. Let’s assume for a few minutes that this isn't only a bubble and that the interest in AI code generation continues and becomes a mindset. Generated code good enough for production, artisanal code something to be written sparingly.

I read a post recently, linked for other reasons, that among other things expressed a wish for AI code generation:

I think AI code generation has the potential to replace software libraries in many cases. When you use a third-party library, you’re importing code that someone else has written to serve a particular function: parsing an RSS feed, say, or sorting an array of variables in a particular way. With code generation, you can prompt the engine to add RSS parsing or array sorting code, and it’ll appear as if by magic. And, unlike a library, that code will be written for you, and may be a better fit for how you want the software you’re writing to work.

Will code generation replace packages entirely? For simple things like left padding a string, it's likely, and probably for the best. More complicated tasks like opening a raster dataset and slicing it, won't be generated with no library dependencies any time soon, I think. If you ask ChatGPT for code to open and slice a raster dataset using Python, you’ll get fairly concise code that uses rasterio, not thousands of lines of plausibly original geospatial format driver code.

If ChatGPT, CoPilot, et al don't suggest a package, does it even exist anymore to developers who are using generative AI tools all the time? How will

Before Can it replace convenience functions? Non-essential code that package users have traditionally loved to use and contribute to open source projects. Yes, possibly. For example, given the prompt

Generate code to determine the minimum integer or floating point data type required to represent rasterio values without losing information.

within the context of some other Python raster data prompting, ChatGPT yields the following code.

import rasterio
import numpy as np

# Replace 'your_raster_file.tif' with the path to your raster file
raster_file = 'your_raster_file.tif'

# Open the raster file
with rasterio.open(raster_file) as src:
    # Read raster values into a NumPy array
    raster_values = src.read(1, masked=True)  # Change the band number if needed

    # Find the minimum and maximum values in the raster
    min_val = np.nanmin(raster_values)
    max_val = np.nanmax(raster_values)

    # Determine the minimum data type required based on the range of values
    if np.issubdtype(raster_values.dtype, np.floating):
        min_dtype = np.float32 if min_val >= np.finfo(np.float32).min and max_val <= np.finfo(np.float32).max else np.float64
    else:
        min_dtype = np.uint8 if min_val >= 0 and max_val <= np.iinfo(np.uint8).max else \
                    np.uint16 if min_val >= 0 and max_val <= np.iinfo(np.uint16).max else \
                    np.uint32 if min_val >= 0 and max_val <= np.iinfo(np.uint32).max else \
                    np.int16 if min_val >= np.iinfo(np.int16).min and max_val <= np.iinfo(np.int16).max else \
                    np.int32 if min_val >= np.iinfo(np.int32).min and max_val <= np.iinfo(np.int32).max else np.float64

    print(f"The minimum data type required to represent raster values without losing information: {min_dtype}")

This is not a terrible take on rasterio’s get_minimum_dtype() https://github.com/rasterio/rasterio/blob/main/rasterio/dtypes.py#L147. It is presumably derived from rasterio’s code to some extent.

It looks like developers will be able to generate convenience functions for themselves. And won’t have to maintain it because they can “summon” it as needed and literally paste it into their projects. These functions might not have to “live” anywhere anymore.

Maintaining non-essential code in a package is a drag. I’d rather be working on higher impact problems. Will ChatGPT/CoPilot allow me to deprecate and remove convenience functions at a faster pace?

If I yank a convenience function from rasterio, will developers be able to trivially replace it in their session using code generation? Do tools for this already exist? Would prompt-able error messages help? Or are they not needed?

I must say that I resent having to figure the implications of this stuff out for myself while the executives selling this technology are getting ready to print money for themselves and their investors.