Import Kaggle Datasets | CSV, SQLite
In this guide, you will learn how to import any dataset from Kaggle (like CSV, and SQLite files) and connect it to PromptQL to be able to query using natural language.
Check out the GitHub repo.
Pre-requisites
You’ll need the Hasura CLI (authenticated via a Hasura account) and Docker installed on your local machine. Links to these steps are below:
- Sign up for a Hasura account (if you haven’t)
- Hasura CLI installed and authenticated
- Docker installed on your local machine
Import your Kaggle Dataset
Step 1: Clone the project
git clone [email protected]:hasura/kaggle-dataset-promptql.git
cd kaggle-dataset-promptql
Step 2: Get the Kaggle Credentials and Dataset Identifier
To get the Username and Key, go to the ‘Account’ tab of your Kaggle user profile and select ‘Create New Token’. This will trigger the download of kaggle.json, a file containing your API credentials.
In order to import the dataset, we need to configure the Identifier of the dataset.
For example, here’s an IMDB dataset - https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset. Now the Kaggle Identifier for this would be something like:
rounakbanik/the-movies-dataset
Replace this with the identifier of choice. This should work for any “.csv”, or “.sqlite” files in Kaggle.
Step 3: Configure .env for kaggle
Head to the app/connector/kaggle
directory to now configure the Dataset.
cd app/connector/kaggle
cp .env.sample .env
Modify the values for the KAGGLE_IDENTIFIER
ENV and add the values for KAGGLE_USERNAME
and KAGGLE_KEY
environment variables obtained from the previous step from the token download.
For example:
KAGGLE_USERNAME="<your_username>"
KAGGLE_KEY="xxxxxxxxxxxxxxxx"
KAGGLE_IDENTIFIER="rounakbanik/the-movies-dataset"
The IMDB example mentioned in the sample env is available as a sample dataset to choose. Feel free to configure the dataset that you would like to.
Step 4: Introspect the Kaggle Connector
ddn connector introspect kaggle --log-level=DEBUG
Note: Depending on how big the dataset is, it should take sometime to fully import the data. The schema will be initialized quickly and the data import happens in the background, so you can proceed to follow the steps below in a different terminal window.
The above command runs in DEBUG mode to make it easy to catch errors for invalid files.
Step 5: Add Models
Based on the dataset imported, a SQL schema would be generated. Let’s track all the models to get started quickly.
ddn model add kaggle "*"
Build your PromptQL app
Now, let’s set up the Hasura DDN project with PromptQL to start exploring the data in natural language!
- Set up the Hasura DDN project already scaffolded in the repo:
In the root directory of the repo, run the following commands:
ddn supergraph build local
ddn project init --with-promptql
- Start the DDN project
Let’s start the DDN project by executing the following command:
ddn run docker-start
- Open the local DDN Console to start exploring:
ddn console --local
This should open up your browser (or print a browser URL) for displaying the Hasura Console. It’ll typically be something like: https://console.hasura.io/local?engine=localhost:3280&promptql=localhost:3282
Ask questions about your dataset
The app will have metadata about the dataset that you just imported above. You should be able to ask domain specific questions and play around with the data.
Here’s a sample of what you can ask to get started.
- Hi, what can you do?
Depending on the dataset schema, PromptQL will tell you what it can answer and you can go from there.
Clean up and restart your app
If you want to reset the data and start from scratch:
You can stop the ddn run docker-start
command, whereever it is running and you can execute the following in the root directory of the repo:
docker compose down -v && ddn run docker-start