Motivation

There were two main motivations fuelling this project: one, to see how reliable ChatGPT is when giving code suggestions from english prompts, and two, to see if Leetcode problems will become an antiquated means of judging ability during job interviews if ChatGPT can answer any Leetcode problem instantly.

Prompt

The exact prompt that was fed to ChatGPT is as follows:
"Using <language>, code a solution for the following prompt:
<problem_content>
Begin your solution with:
<problem_header>"

Model

It must be noted that this uses GPT-3-DaVinci-003 model as opposed to the actual ChatGPT model. ChatGPT does not offer an API and it is against the terms of service to scrape ChatGPT results as it will rate limit actual users from being able to access ChatGPT. That being said, GPT-3-DaVinci-003 is stated by OpenAI to be much more powerful than ChatGPT. According to this site, "GPT-3 protocol is much larger than ChatGPT. The former has a whopping 175 billion parameters making it one of the largest and most powerful AI language processing models to date, while the latter has 20 billion parameters. This is because ChatGPT has been specifically designed to generate human-like text in the context of chatbot conversations." This means that the model that is being used, GPT-3-DaVinci-003, is likely even better to be using for code generation that ChatGPT. For the sake of this project, there may be places where "ChatGPT" is used synonymously to represent GPT-3-DaVinci-003.

General Algorithm

  1. Get all the non-premium problems from Leetcode (due to the terms of service no premium Leetcode problems were scraped)
    1. Scrape all the URLs of the available problems (51 pages with ~25 problems per page).
    2. Iterate the table of problems for each URL and save the problem details to the database.
  2. Create the prompt for each problem and save it to the database.
  3. Get ChatGPT responses to the prompts
    1. Feed the prompt into the OpenAI model and save the responses to the database.
  4. Submit the responses to Leetcode
    1. Navigate to the problem page.
    2. Select the desire language (MySQL for database problems, Bash for shell problems, and python3 for everything else).
    3. Enter the response into the the problem field.
    4. Submit the response on Leetcode.
    5. Record in the database whether the response was successful or a failure and any performance metrics (Runtime beats, Memory beats, Error type, Testcases passed).

Manual Clean Ups

After scanning the data a few problems returned errors that did not seem fair to blame on ChatGPT, so they were fixed and re-ran.

  • The python keyword `pass` would be placed before the solution causing the true solution to be skipped (around ~10 occurrences)
  • The solution would be preceded or followed by an ill-formed docstring or a rogue ellipses [..., ''', """,```] (around ~20 occurrences)
  • Due to the way the prompt was formed, the response would sometimes double the header, so the class would be defined twice (around ~100 occurrences)
  • If the problem asked for a follow up (ex. "how could this be improved?"), ChatGPT would add a paragraph of English content at the end of the solution which would result in an error (around ~10 occurrences)

Please keep in mind, some of these issues may still be present as well as others that were not detected.

Findings

  • The responses were far from perfect, but were, overall, very coherent and impressive.
  • Like a human, the responses did better on easier problems than on more difficult problems.
  • Quite often the responses would have a pattern that would cause an error while otherwise having a valid response (see above Manual Clean Ups).
  • GPT-3-DaVinci-003 actually performed better than Code-DaVinci-002 which is OpenAI's specific model for crafting code responses (keep in mind this was only noticed after manually testing ~10 problems).
  • GPT performed worse with database and bash problems than what was hypothesized.
  • GPT seemed to perform better with specific algorithms than others although it has not been investigated as to what may have been the reason.
  • When GPT failed it seemed to favor specific error types over others.
  • GPT responded to the entire set of Leetcode (non-premium) problems in around <20 minutes (1984 problems) not including submission time and OpenAI api response time which were both limited by response rate-limiting.
  • When GPT got an answer correct, it performed specifically well in regards to runtime and memory, although both of these factors are unreliable when reported by Leetcode (ex. refresh your submission and you will get different results).