Import Huggingface Datasets | CSV, Parquet, SQLite
In this guide, you will learn how to import any dataset from Huggingface (like CSV, Parquet, and SQLite files) and connect it to PromptQL to be able to query using natural language.
Check out the GitHub repo.
Pre-requisites
You’ll need the Hasura CLI (authenticated via a Hasura account) and Docker installed on your local machine. Links to these steps are below:
- Sign up for a Hasura account (if you haven’t)
- Hasura CLI installed and authenticated
- Docker installed on your local machine
Import your Huggingface Dataset
Step 1: Clone the project
git clone [email protected]:hasura/huggingface-dataset-promptql.git
cd huggingface-dataset-promptql
Step 2: Get the Huggingface Dataset ID handy
In order to import the dataset, we need to configure the Dataset ID with the path to the file(s).
For example, here’s a top 1000 IMDB dataset - https://huggingface.co/datasets/drossi/EDA_on_IMDB_Movies_Dataset. Now the dataset ID for this would be something like:
drossi/EDA_on_IMDB_Movies_Dataset/*.csv
Notice the usage of glob patterns for selecting all the CSV files. Replace this with the dataset of choice along with the path to the files. This should work for any “.csv”, “.parquet” or “.sqlite” files in Huggingface.
Refer to this DuckDB blog for more examples of this format.
Step 3: Add Anthropic API key to .env
Set up your .env file with your Anthropic (or OpenAI) API key and GitHub API token.
cp .env.example .env
Get an api key from https://console.anthropic.com/settings/keys
# .env
...
...
ANTHROPIC_API_KEY=<your-anthropic-api-key>
To use an OpenAI key instead, you’ll have to set OPENAI_API_KEY
in your .env file and change the environment variable LLM
to openai
in the compose.yaml file.
Step 4: Configure .env for huggingface
Head to the app/connector/huggingface
directory to now configure the Dataset.
cd app/connector/huggingface
cp .env.sample .env
Modify the values for the HUGGINGFACE_DATASET
ENV. It is of the format: “user/dataset/file-path”.
For example:
HUGGINGFACE_DATASET="drossi/EDA_on_IMDB_Movies_Dataset/*.csv"
The IMDB example mentioned in the sample env is available as a sample dataset to choose. Feel free to configure the dataset that you would like to.
Step 5: Introspect the Huggingface Connector
ddn connector introspect huggingface --log-level=DEBUG
Note: Depending on how big the dataset is, it should take sometime to fully import the data. The schema will be initialized quickly and the data import happens in the background, so you can proceed to follow the steps below.
The above command runs in DEBUG mode to make it easy to catch errors for invalid files.
Step 6: Add Models
Based on the dataset imported, a SQL schema would be generated. Let’s track all the models to get started quickly.
ddn model add huggingface "*"
Build your PromptQL app
Now, let’s set up the Hasura DDN project with PromptQL to start exploring the data in natural language!
- Set up the Hasura DDN project already scaffolded in the repo:
In the root directory of the repo, run the following commands:
ddn supergraph build local
ddn project init
- Start the DDN project
Let’s start the DDN project by executing the following command:
ddn run docker-start
- Open the local DDN Console to start exploring:
ddn console --local
This should open up your browser (or print a browser URL) for displaying the Hasura Console. It’ll typically be something like: https://console.hasura.io/local?engine=localhost:3280&promptql=localhost:3282
Ask questions about your dataset
The app will have metadata about the dataset that you just imported above. You should be able to ask domain specific questions and play around with the data.
Here’s a sample of what you can ask to get started.
- Hi, what can you do?
Depending on the dataset schema, PromptQL will tell you what it can answer and you can go from there.
Clean up and restart your app
If you want to reset the data and start from scratch:
You can stop the ddn run docker-start
command, whereever it is running and you can execute the following in the root directory of the repo:
docker compose down -v && ddn run docker-start