ChatGPT and Data Dive

Generating data manually can sometimes be a chore, especially if there's a large dataset needed. Our Data Expert Geoff Sanderson takes a look into if the famous tool ChatGPT is capable of generating useful data.

Recently I embarked on building a simple Synapse pipeline, delivering blob data to an Azure SQL Database, with a view to prove out other related features of this Azure service. I had limited time and wanted to create a csv file with random but structured data.  I then decided I wanted to get an insight into whether I could automate the process I was about to embark on, in a repeatable fashion.

 

So, I thought, can I utilise an AI based service without any setup overhead to deliver what I was asking for?

Ideally, I want an on-demand AI service which I can give a brief natural language description of the data structure, to deliver a useable file.

I want to then take that file and push it into my pipeline and ultimately deliver it to my Azure SQL database.

 

I asked chatgpt the following:

 

From that, chatgpt generated a python routine to ultimately create random data.

 

For me, this is impressive, but when aligning with what I am trying to deliver, it highlighted the following:

 

Pros:

  • Useable code.
  • Take the code, run it through a python interpreter, we then have a generated file.

 

Cons

  • It's not giving me a csv file, it's giving me a way to create a csv file.
  • I need a python interpreter, an assumption has been made that I know how to use python, and that I have a python environment available to me to run the python script.
  • Although chatgpt interpreted the structure from a data type point of view (which is impressive in and on of itself) the random string values the code generated would not contain realistic values.

 

To be totally fair to chatgpt, all of the above is my fault, as I have not been specific enough in the way I have asked Chatgpt for results…

 

Before taking the generated python code and running it, I ask chatgpt the following, shifting the emphasis from generating to displaying content:

 

 

With the following returned:

 

This is great, I can now hit the Copy code button, paste it into a text editor, save the file and pass it into my data pipeline.  Come to think of it, that may be just as fiddly as running the generated python code, but at least it's given me my realistic job titles! 😊

 

At this point, I'm thinking I now have a set of repeatable 'search' statements, to pass to chatgpt to generate further csv test files…or do I?

 

I open up a separate chatgpt session and ask the same last question, expecting a further lump of csv to be displayed for me…

 

…but alas it generates a further python file.

 

My next thought is chatgpt needs the whole conversation to be repeated, not just the last statement.

 

I try this… but unfortunately it still doesn't give the original csv based results, but instead offers up further python scripts.  The python files are great, they look like they will work, but it's not the repeatable results I was after.

 

Key take away for me here is:

  • I was able to get chatgpt to cover my highlighted use case to generate a test data file, but…
  • As with any search mechanism it's about reducing the data/information being returned by better identifying what is needed.
  • From an automation and integration point of view, due to its nature, it may not be consistent when re-processing the same search queries, in many use cases this may not be an issue, but for many others it will be.
  • It also may be too inconsistent to use in a real-world live orchestration, where specific consistent structured results are required, although using more verbose rigid statements may well address this, but further investigation would be needed.
  • For my purposes, using the chatgpt approach was quicker to get what I needed, rather than myself manually writing a routine (not necessarily in python…) to do this.  It took away the braincells needed to come up with the logic for the code, but those same braincells instead where required to nudge it in the desired direction too.

Read more from The IJYI Way Insight 

About the author

Geoff Sanderson

An experienced Data Specialists with over a decade experience in data and 20 plus in the technology industry. 80's music enthusiast and Yamaha QY70 mega fan.