How Can AI Tools Actually Improve Development Workflows?
Looking back at the short history of AI development tools, to learn how they have and could actually improve our workflows.

Back in 2021, I wrote one of my first articles ever. Directly inspired by my use of a brand-new tool, it was called “A Deep Dive into GitHub Copilot”. Copilot, which came out a month before I wrote this article, was absolutely fascinating to me. A machine capable of anticipating the next sequence of a program, with accuracy, was a dream come true.
Copilot marked itself as a precursor to the AI boom itself, which really kicked off the following year. For many, Copilot was the first practical interaction with a large language model of any kind. The fancy tool, based on GPT-3 and OpenAI’s “Codex”, blew the competition out of the water. I remember using Tabnine and other tools before this, with Copilot completely stealing my attention. However, since the release of Copilot and subsequent AI tools, things have absolutely blown up.
It’s hard to go a day without the release of some new tool promising to revolutionize the development landscape. Unfortunately, many of these tools haven’t shaped up. While Copilot saw an initial interest in a large percentage of the developer population, a growing number find it counterproductive.
However, a growing number of developers do find themselves using AI tooling. In the past year, 62% of developers used AI tools in their workflow, up from 44% the year prior. This prompts many important questions about the future of development, and the nature of integration of the popular new technology. After almost three years since the release, Copilot, and other tools have taught us multiple important lessons.
Types of tools
Before we can break down the lessons learned over the past few years, we need to understand the state of the landscape. What kinds of tools are there? What purposes do they serve? To start, I’ll just run through a few types of language-model-integrated development tool:
IDEs: Integrated Development Environments, structured around language model integrations across the board
Git clients: AI integrations into Git version control
Assistants: Chat-based tools for working with language models conversationally
Shell Assistants: Same thing as assistants, but in the shell
Agents: Autonomous programs which can perform more complex tasks through a combination of reasoning and tool use
App Generators: Pipelines which take a description of an application, and create it end-to-end
UI Generators: Pipelines which take a description of a UI component, and create it end-to-end
Snippet Generators: Translations and generations of more discrete pieces of data
Documentation Tooling: Tools for generating or maintaining documentation for your project so you don’t have to
Code Generation: Tools for actually doing software development… maybe
Search Tools: Tools for doing complex code-search through a repository
Testing Tools: Tools for testing your project
Foundation Models: The big boys, built by big companies, upon which other tools are built
So, there are many types of tool that you may encounter as a developer in 2024, either “enhanced” by or completely based on language models. It makes sense, as well, that developers would notice new automation opportunities within their workflows, and attempt to make them real. Of these many categories, however, I’d say that there are a few key approaches to tooling that have come from the AI boom, independent of specific workflow attachment.
First, we have inline completion tools. These tools, such as Copilot and its alternatives, seek to improve developer productivity by increasing the speed at the lowest level. The operation in question is insertion. Writing code is a key additive operation for software development. If new logic is required, then developers must write new code to perform that logic, and Copilot does its work here.
Second, we have conversational tools. ChatGPT is probably the best known name in the space, but Anthropic’s Claude and Google’s Gemini are also popular examples. Essentially, these tools attempt to be a sort of oracle, which you can ask for answers to your questions about the universe, or code. These interfaces can be enhanced by helpful features such as retrieval augmented generation, or other integrations with editor tooling.
The third major type of tool is the agent. Agents are all the rage at the moment, and while they’re not extremely advanced yet, they are the area that I think shows the most promise. Agents are capable of executing multistep tasks and operations, allowing for more complex overall task completion. While there are certain accuracy and architectural pitfalls to work out, with advancements in the architecture, we can expect significant utility from these tools.
(For a comprehensive and maintained list of AI tools, check out this GitHub Repository)
Cognitive Atrophy and Compliments
Across the entire development workflow, tool developers have found ways to integrate language models, in the hopes of creating something more useful than before. However, to what degree has that been achieved? Not only that, but what lessons have been learned over the course of the past few years about successful integration of AI tools into workflows?
Well, let’s start with Copilot. Initially released as a technical preview in June 2021, Copilot was initially met with skepticism, and excitement. Developers like me found the tool very useful, and a significant time-saver in writing code. It was sometimes inaccurate, but overall, for boilerplate and repetitive tasks, Copilot was extremely helpful. However, there were significant concerns.
First, Copilot, had been trained on open-source code, which could be considered unethical depending on how well software license compliance was handled. However, even in open-source codebases, copyrighted code could still be present. Whether it be illegally present code, or code which was used legally under a specific license, it was possible that Copilot had been trained on problematic code. In such a case, if a developer were to use the tool, and add copyrighted code to their code base, legal issues could ensue.
Of course, that’s a legal technicality. We’re more interested in the productivity implications. Copilot, while to many a positive force in productivity, has given way to a phenomenon known as the “Copilot Pause”. Essentially, a user of an application like Copilot might anticipate the completion, and suspend their mind while waiting for the tool to write code for them. To some, this can be easily avoided by just… not doing that.
However, to others, this is evidence of the formation of a mental crutch, whereby a developer offloads cognitive effort to Copilot. While this is perhaps the point of the tool, reasonable fears exist whether this is a positive direction for the industry. Some report a sort of coding muscle atrophy after significant use of the tool, where skills are lost over time due to a lack of use. This problem isn’t exclusive to Copilot either, as any tool which automates something which previously required a specific skill, can be expected to produce a similar outcome.
This isn’t necessarily bad, especially if the new outcome is a reasonable replacement for what we had before. If a tool like Copilot can produce good code, then why should we require that it be human made? More on that later. It seems that we must keep in mind that leaning too much on tools like Copilot is likely to lead to some sort of cognitive atrophy. However, that might be an issue with mindset, not the tool itself. Perhaps tools are best thought of as complimentary, rather than replacements.
ChatGPT, for example, can’t replace developers when writing code. That’s because the interface is inherently designed to be conversational. It can generate code, for sure, but it’s then up to the developer to actually utilize it, and put it where it needs to go. Of course, many will simply copy and paste without any validation between, but at least there is a step between. This means that in order to gain the code, you first have to ask for it. There is a thinking required through the conversational interface, which offsets the atrophy a bit.
It’s also worth noting that thinking of language models as complimentary is a good idea anyway. Prompt engineers know this, a language model’s performance is directly connected to the quality and content of the prompt. When it navigates its semantic space, it will do so based on tokens in the input sequence, and therefore you can coerce specific performance from it. The more that you require the language model to assume, the more it will do so without question. So, one way to quickly see performance of a language model improve, is to give it more clear directions, which can oftentimes be done best through conversation.
However, we shouldn’t discredit tools like Copilot. This requirement of a point of reference might prove that having additional code ready-to-go could drastically improve the accuracy of the tool. Inline LLM suggestions might be best when you simply want a recurring pattern repeated, and you have reference points within the Copilot’s context. However, in cases where you want to be more instructive, you can turn to a tool like ChatGPT, which allows you to more selectively provide context for a generation.
Meanwhile, tools like the now infamous Devin have gained widespread criticism given their significant error rate. A tool like Devin is an agent, and it’s certainly impressive. The agent can complete GitHub tickets, implement features, etc. However, this is where we run into one of the fundamental issues with agents as a whole. The accuracy rate.
An agent with a 90% accuracy rate sounds pretty good, right? Well, unfortunately, that 10% failure rate means that you will have to verify every single output from the model to check for a failure. Even if the accuracy increases, a 99% accurate model will still require a validation step due to the 1 time in 100 that it would fail. The accuracy itself isn’t the problem. Human developers aren’t always accurate one-shot.
A junior developer is often seen as a hire which carries risk. Since a junior is less experienced, less responsibility is given at first, and must be earned through experience. A language model doesn’t really have the ability to adapt and become better at a specific task, or take responsibility for its actions. Thus, a language model must pass through a human review stage before significant changes are implemented, which could be quite tedious for the reviewer. The tradeoff in code generation might be offset by the review stage, since the human didn’t take part in the initial creation.
It remains to be seen whether a tool like Devin will actually be implemented in corporate environments, and what that might look like. However, in the meantime, I predict that the best LLM-based tools will emphasize complimenting humans over replacing them. At least for now.
Responsible Use
Given the lessons that we’ve learned, how can individuals like you and me best use AI tools in our workflow? Well, as with all automation, it comes down to something like the balance of convenience and quality. For example, code that we allow an LLM to generate is limited to the performance of that specific model. There is a line to toe with language models, and finding that balance may be difficult.
For example, you could say that you can allow the LLM to generate code to complete a requirement, and then simply review it later. In such a case, you’ve offloaded the effort of generating code, to another system, in the name of convenience. While you may review the code and determine whether it fits requirements or not, your cognitive scope has now been limited to the suggestion of the model. There are theoretical solutions to the initial problem that you’ll never stumble across in the brainstorming process.
That might be a benefit, as it means you don’t have to spend as much time traversing the mental space yourself. Additionally, you’d have a point of reference for brainstorming, through which a better answer might be found. However, we do have to keep in mind that language models are static, and inherently lack innovative potential. They are approximations of language, and so we would expect them to reinforce patterns based on prominence in training data, not necessarily quality.
The tradeoff for convenience and offloading of effort is the loss of quality and thoroughness. What might seem worth it for the sake of convenience, may become critically inconvenient down the line due to being of poor quality. As developers, and going further, as software engineers, balancing tradeoffs is our job. So I’d say that to use AI tools responsibly, it’s best to understand them as automations and abstractions, and balance that critical tradeoff to maximize productivity and quality.
For my part, I’ve really enjoyed using the pre-1.0 Zed editor, and I’ll probably end up writing about it in the future. The language model integration is simple, yet convenient and intuitive. It feels like a more organic extension of the editor’s functionality than a completely new feature, which I like. I would recommend checking it out.
Author’s Note
I’ve found this topic fascinating, as this is obviously an area of endless curiousity. Software development is set to see some strange times due to advancements in technology, and we’ll probably see a lot of downsides to new trends. However, finding real value in these things is a worthwhile passtime. If you haven’t already, check out last week’s post on the past, present, and future of software development.
As always, thank you so much for reading, and goodbye :)
Notice to Subscribers
Just a heads up, I’m going on vacation next week! Unfortunately, this means that there will be no on August 19th, but you can expect one the week after on August 26th! You can, however, expect Byte Sized posts as usual :)
Credits
Thumbnail:
Bilal Azhar at https://substack.com/@intelligenceimaginarium
Music: Track - Feeling Good by Pufino, Source - https://freetouse.com/music, Free Music No Copyright (Safe)