Import Kaggle Datasets | CSV, SQLite

In this guide, you will learn how to import any dataset from Kaggle (like CSV, and SQLite files) and connect it to PromptQL to be able to query using natural language.

Check out the GitHub repo.

Pre-requisites

You’ll need the Hasura CLI (authenticated via a Hasura account) and Docker installed on your local machine. Links to these steps are below:

Sign up for a Hasura account (if you haven’t)
Hasura CLI installed and authenticated
Docker installed on your local machine

Import your Kaggle Dataset

Step 1: Clone the project

git clone [email protected]:hasura/kaggle-dataset-promptql.git
cd kaggle-dataset-promptql

Step 2: Get the Kaggle Credentials and Dataset Identifier

To get the Username and Key, go to the ‘Account’ tab of your Kaggle user profile and select ‘Create New Token’. This will trigger the download of kaggle.json, a file containing your API credentials.

In order to import the dataset, we need to configure the Identifier of the dataset.

For example, here’s an IMDB dataset - https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset. Now the Kaggle Identifier for this would be something like:

rounakbanik/the-movies-dataset

Replace this with the identifier of choice. This should work for any “.csv”, or “.sqlite” files in Kaggle.

Step 3: Configure .env for kaggle

Head to the app/connector/kaggle directory to now configure the Dataset.

cd app/connector/kaggle
cp .env.sample .env

Modify the values for the KAGGLE_IDENTIFIER ENV and add the values for KAGGLE_USERNAME and KAGGLE_KEY environment variables obtained from the previous step from the token download.

For example:

KAGGLE_USERNAME="<your_username>"
KAGGLE_KEY="xxxxxxxxxxxxxxxx"
KAGGLE_IDENTIFIER="rounakbanik/the-movies-dataset"

The IMDB example mentioned in the sample env is available as a sample dataset to choose. Feel free to configure the dataset that you would like to.

Step 4: Introspect the Kaggle Connector

ddn connector introspect kaggle --log-level=DEBUG

Note: Depending on how big the dataset is, it should take sometime to fully import the data. The schema will be initialized quickly and the data import happens in the background, so you can proceed to follow the steps below in a different terminal window.

The above command runs in DEBUG mode to make it easy to catch errors for invalid files.

Step 5: Add Models

Based on the dataset imported, a SQL schema would be generated. Let’s track all the models to get started quickly.

ddn model add kaggle "*"

Build your PromptQL app

Now, let’s set up the Hasura DDN project with PromptQL to start exploring the data in natural language!

Set up the Hasura DDN project already scaffolded in the repo:

In the root directory of the repo, run the following commands:

ddn supergraph build local
ddn project init --with-promptql

Start the DDN project

Let’s start the DDN project by executing the following command:

ddn run docker-start

Open the local DDN Console to start exploring:

ddn console --local

This should open up your browser (or print a browser URL) for displaying the Hasura Console. It’ll typically be something like: https://console.hasura.io/local?engine=localhost:3280&promptql=localhost:3282

Ask questions about your dataset

The app will have metadata about the dataset that you just imported above. You should be able to ask domain specific questions and play around with the data.

Here’s a sample of what you can ask to get started.

Hi, what can you do?

Depending on the dataset schema, PromptQL will tell you what it can answer and you can go from there.

Clean up and restart your app

If you want to reset the data and start from scratch:

You can stop the ddn run docker-start command, whereever it is running and you can execute the following in the root directory of the repo:

docker compose down -v && ddn run docker-start

Huggingface Datasets CSV Files