Packages in the age of generative AI

What does “AI” and code generation mean for authors and maintainers of software packages and libraries? I'm the author and maintainer of several that generative AI (dropping the scare quotes to save keystrokes) will recommend to you in the geospatial domain. Generative AI companies claim that the technology changes everything. Microsoft proposes to add a CoPilot button to keyboards, for example. I'm less gung-ho. Certainly the technology stands to change some things, but what and by how much? I'm going to poke at the question a bit on my blog from the perspective of an open source package author and maintainer.

For better or worse, many developers and analysts seem cautiously to wildly enthusiastic about tools like CoPilot. Let’s assume for a few minutes that this isn't only a bubble and that the interest in AI code generation continues and becomes a predominant mindset.

I read a post recently, linked for other reasons, that among other things expressed a wish for AI code generation:

I think AI code generation has the potential to replace software libraries in many cases. When you use a third-party library, you’re importing code that someone else has written to serve a particular function: parsing an RSS feed, say, or sorting an array of variables in a particular way. With code generation, you can prompt the engine to add RSS parsing or array sorting code, and it’ll appear as if by magic. And, unlike a library, that code will be written for you, and may be a better fit for how you want the software you’re writing to work.

Will code generation replace packages entirely? For super simple stuff like “left pad”, very likely. For more complicated tasks like opening a raster dataset and slicing it, not any time soon, I think. If you ask ChatGPT to do so using Python, you’ll get code that uses rasterio.

But can it replace convenience functions? Non-essential code that package users have traditionally loved to use and contribute to open source projects. Yes, possibly. For example, given the prompt

Generate code to determine the minimum integer or floating point data type required to represent rasterio values without losing information.

within the context of some other Python raster data prompting, ChatGPT yields the following code.

import rasterio
import numpy as np

# Replace 'your_raster_file.tif' with the path to your raster file
raster_file = 'your_raster_file.tif'

# Open the raster file
with rasterio.open(raster_file) as src:
    # Read raster values into a NumPy array
    raster_values = src.read(1, masked=True)  # Change the band number if needed

    # Find the minimum and maximum values in the raster
    min_val = np.nanmin(raster_values)
    max_val = np.nanmax(raster_values)

    # Determine the minimum data type required based on the range of values
    if np.issubdtype(raster_values.dtype, np.floating):
        min_dtype = np.float32 if min_val >= np.finfo(np.float32).min and max_val <= np.finfo(np.float32).max else np.float64
    else:
        min_dtype = np.uint8 if min_val >= 0 and max_val <= np.iinfo(np.uint8).max else \
                    np.uint16 if min_val >= 0 and max_val <= np.iinfo(np.uint16).max else \
                    np.uint32 if min_val >= 0 and max_val <= np.iinfo(np.uint32).max else \
                    np.int16 if min_val >= np.iinfo(np.int16).min and max_val <= np.iinfo(np.int16).max else \
                    np.int32 if min_val >= np.iinfo(np.int32).min and max_val <= np.iinfo(np.int32).max else np.float64

    print(f"The minimum data type required to represent raster values without losing information: {min_dtype}")