AI Data Factories: Power Your DS Project with Synthetic Data

Illustartion of synthetic data generation

If you’ve worked for a minute in machine learning, you’ve likely encountered this issue: good data is difficult to find. Whether it’s privacy limits, small samples in niche spaces, or simply the mere inconvenience of labelling, collecting data can be a chokepoint. But I’ve recently been experimenting with something that’s revolutionising the process—using LLM as a synthetic data generator. Yes, the very same models that we use to train chatbots and code completions can be used to build the datasets that we can use for training.

Why bother with synthetic data at all?

Here’s the catch—genuine, real-world data does not necessarily come affordably, even if you happen to find it, it could be:

Full of privacy concerns
Distorted or unbalanced
Inadequate to train a deep model
Costly or labour-intensive to annotate

That’s where synthetic data comes in. Rather than sitting around twiddling our thumbs waiting for neater, cleaner, labelled data, what if we could simply create it ourselves?

LLMS are Not Just about Chatting

Most people view LLMS like GPT-4 or Claude as intelligent assistants or programming tools. Nevertheless, deep down, they are most effective in understanding and mirroring patterns in language, from email spam to customer support requests.

Some Cool Use Cases

I have tried out a few examples myself, One suggestion is to keep the prompt as detailed as possible. The key is always the prompt; make sure the LLM has less to guess, the more detailed the prompt, the better the synthetic data. To learn more about how to structure your prompt, read our blog.

Email Spam Filtering

You can direct an LLM to generate text data for email spam filtering

I am Working on a E-mail classification system using NLP. So, I need text E-mail data which should be realistic for me to learn with all the real-word problems for data processing. Assume you are a data scientist and give me the best possible data:

1. Generate 50 spam and 50 non-spam email samples for a binary classification problem.
2. Make sure the data contains diverse context and theme, so I can have good experience capturing the meaning of the text.
3. Make sure the E-mail has a minimum length of 20 words.
4. Give me the output in json format of the data as shown below without any other text

Format:
--------
{{'email': email content, 'label': spam}, {'email': email content, 'label': spam}}

Output: Redacted, only a sample is shown. Run the prompt yourself and see the wonder of AI!

[
{"email": "Dear User, Congratulations! You have won a $1,000 Amazon gift card. To claim, click the link below and enter your details immediately. Limited time offer!", "label": "spam"},
{"email": "Subject: Your Appointment Confirmation\n\nDear [Patient Name],\n\nThis email confirms your appointment with Dr. [Doctor's Name] on April 29th at 3:00 PM. Please arrive 15 minutes prior.", "label": "non-spam"},
{"email": "Hi [Landlord's Name],\n\nI am writing to inform you that I will be renewing my lease for another year, starting from May 1st. Please let me know the next steps.", "label": "non-spam"},
{"email": "Hi [Travel Agent's Name],\n\nI'd like to inquire about flight options to Bali for two people in July. Could you please send me some available itineraries and prices?", "label": "non-spam"},ill be paid $100 per hour. Apply now and start your exciting new career!", "label": "spam"},
]

Sales Dataset

You can use LLM to generate artificial sales data for you to try on projects like pricing analytics, demand forecasting, e.t.c

I am working on a demand forecasting problem in the retail sector. So, I need you to provide me a CSV dataset with the following columns:

1. Store ID
2. Item ID
3. Date
4. Units sold in a day
5. Sales in Day
5. Price of Item
6. Discount
7. Is holiday?
8. cost

Assume you are a data analyst, and provide me with at least 100 stores having at least 20 items in each store, with sales varying across seasons to have some seasonality. The data should be at least 2 years, with at least 2 price changes for each item, so I can learn the demand vs price relation. Have cannibalisation also in the sales, when a price change happens to an item, it should impact other items' sales to simulate a realistic scenario.

Generate me a csv file.

Output: It has generated a beautiful CSV file; find the file here. When I looked at the output, I was amazed to see that AI has managed to follow the real patterns in data. In the image below, we can see that whenever there is an increase in discount or a price decrease, sales increase. This is really wonderful, we can definitely use AI to generate data for us to learn some modelling!

Visualising the data generated by AI to see intricate pattern followed by AI while generating the synthetic data

Why are you still waiting? Have an idea in mind, run and generate data with AI and start implementing it. If you want to kickstart your development, once you have data, leverage the power of Gemini AI to kickstart your DS project with a single prompt.

Conclusion

If you’ve never tried playing around with using LLMS to generate data, you should. It’s fun, useful, and unsettlingly potent. Begin small: Create a small dataset to get used to prompting, then reach big. It is useful for people learning data science projects. Using started datasets like Titanic from Kaggle holds no value; generate your own dataset and own the project.

Be specific with your prompt, a good prompt is the key to good data. Happy developing!!

AI ML Universe

AI Data Factories: Power Your DS Project with Synthetic Data

Why bother with synthetic data at all?

LLMS are Not Just about Chatting

Some Cool Use Cases

Email Spam Filtering

Sales Dataset

Conclusion

If you have enjoyed reading this consider subscribing to the Newsletter, to get latest updates!!

Leave a Reply Cancel reply

Subscribe to our Newsletter

Contents

About

Archive

Recent Post

Tags