[Home] Edit Article
Author Information
File Uploads
Edit Article XML Content
<document> <header> <issuecode /> <articlecode /> <zone /> <title>Using DuckDB for Data Analytics</title> <authors /> <copyright>CODE Magazine</copyright> <owner>CODE Magazine</owner> </header> <body> <p id="0">Very often, when people talk about data analytics, Pandas is the first library that comes to mind. And, of course, in more recent times, Polars is fast gaining traction as a much faster and more efficient DataFrame library. Despite the popularity of these libraries, SQL (Structured Query Language) remains the language that most developers are familiar with. If your data is stored in SQL-supported databases, SQL is one of the easiest and most natural ways for you to retrieve your data.</p> <p id="1">Recently, with Python becoming the lingua franca of data science, most attention has shifted to techniques on how to manipulate data in tabular format (most notably stored as a DataFrame object). However, the real lingua franca of data is actually SQL. And because most developers are familiar with SQL, isn’t it more convenient to manipulate data using SQL? This is where DuckDB comes in. </p> <p id="2">In this article, I’ll explain what DuckDB is, why it’s useful, and, more importantly, I want to walk you through examples to demonstrate how you can use DuckDB for your data analytics tasks.</p> <h2>What Is DuckDB?</h2> <p id="3">DuckDB is a Relational Database Management System (RDBMS) that supports the Structured Query Language (SQL). It’s designed to support Online Analytical Processing (OLAP), and is well suited for performing data analytics. DuckDB was created by Hannes Mühleisen and Mark Raasveldt, and the first version released in 2019.</p> <p id="4">Unlike traditional database systems where you need to install them, DuckDB requires no installation and works in-process. Because of this, DuckDB can run queries directly on Pandas data without needing to import or copy any data. Moreover, DuckDB uses vectorized data processing, which makes it very efficient—internally, the data is stored in columnar format rather than row-format (which is commonly used by databases systems such as MySQL and SQLite).</p> <p id="5">Think of DuckDB as the analytical execution engine that allows you to run SQL queries directly on existing datasets such as Pandas DataFrames, CSV files, and traditional databases such as MySQL and Postgres. You can focus on using SQL queries to extract the data you want. </p> <h3>Why DuckDB?</h3> <p id="6">Today, your dataset probably comes from one or more of the following sources:</p> <list type="bulleted"> <bulletedlist>CSV files</bulletedlist> <bulletedlist>Excel spreadsheets</bulletedlist> <bulletedlist>XML files</bulletedlist> <bulletedlist>JSON files</bulletedlist> <bulletedlist>Databases</bulletedlist> </list> <p id="7">If you want to use SQL to manipulate your data, the typical scenario is to first load the dataset (such as a CSV file) into a database server. You then load the data into a Pandas DataFrame through an application (such as Python) using SQL (see <b>Figure 1</b>).</p> <figure id="1" src="image1.png"> <b>Figure </b> <b>1</b> <b>:</b> Import your data into a database server before you can use SQL to query your data.</figure> <p id="8">DuckDB eliminates the need to load the dataset into a database server and allows you to directly load the dataset using SQL (see <b>Figure 2</b>).</p> <figure id="2" src="image2.png"> <b>Figure </b> <b>2</b> <b>:</b> You can use DuckDB to directly query your dataset using SQL.</figure> <p id="9">Once the DataFrame is loaded, you can use DuckDB and SQL to further slice and dice the DataFrame (see <b>Figure 3</b>).</p> <figure id="3" src="image3.png"> <b>Figure </b> <b>3</b> <b>:</b> You can also use DuckDB to query Pandas’ DataFrames using SQL.</figure> <h2>Data Analytics Using the Insurance Dataset</h2> <p id="10">The best way to understand DuckDB is through examples. For this, I’ll use the insurance dataset located at https://www.kaggle.com/datasets/teertha/ushealthinsurancedataset?resource=download. The insurance dataset contains 1338 rows of insured data, where the insurance charges are given against the following attributes of the insured: age, sex, BMI, number of children, smoker, and region. The attributes are a mix of numeric and categorical variables.</p> <h3>Loading the CSV File into Pandas DataFrames</h3> <p id="11">Let’s examine the insurance dataset by loading the insurance.csv file into a Pandas DataFrame:</p> <codesnippet> <font color="Blue">import</font>»pandas»<font color="Blue">as</font>»pd</codesnippet> <codesnippet>df_insurance»=»pd.read_csv(<font color="#A31515">"insurance.csv"</font>)</codesnippet> <codesnippet>display(df_insurance)</codesnippet> <p id="12"> <b>Figure </b> <b>4</b> shows how the DataFrame looks.</p> <figure id="4" src="image4.png"> <b>Figure </b> <b>4</b> <b>:</b> The insurance dataset loaded as a Pandas DataFrame</figure> <p id="13">The various columns in the DataFrame contain the various attributes of the insurance customer. In particular, the <b>charges</b> column indicates the individual medical costs billed by health insurance (payable by the insured).</p> <h3>Creating a DuckDB Database</h3> <p id="14">Before you can create a DuckDB database, you need to install the <b>duckdb</b> package using the following command:</p> <codesnippet>!pip»install»duckdb</codesnippet> <p id="15">To create a DuckDB database, use the <b>connect()</b> function from the <b>duckdb</b> package to create a connection (a <b>duckdb.DuckDBPyConnection</b> object) to a DuckDB database:</p> <codesnippet> <font color="Blue">import</font>»duckdb</codesnippet> <codesnippet>conn»=»duckdb.connect()</codesnippet> <p id="16">You can then register the DataFrame that you loaded earlier with the DuckDB database:</p> <codesnippet>conn.register(<font color="#A31515">"insurance"</font>,»df_insurance)</codesnippet> <p id="17">The <b>register()</b> function registers the specified DataFrame object (<b>df_insurance</b>) as a virtual table (<b>insurance</b>) within the DuckDB database.</p> <h3>Directly Loading the CSV into a Pandas DataFrame Using DuckDB</h3> <p id="18">Instead of loading a DataFrame manually and then registering with the DuckDB database, you can also use the connection object to read a CSV file directly, like this:</p> <codesnippet>conn»=»duckdb.connect()</codesnippet> <codesnippet>df»=»conn.execute(<font color="#A31515">'''</font></codesnippet> <codesnippet> <font color="#A31515">»»»»SELECT»</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»»»»»*»</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»FROM»read_csv_auto('insurance.csv')»»</font> </codesnippet> <codesnippet> <font color="#A31515">'''</font>).df()</codesnippet> <codesnippet>conn.register(<font color="#A31515">"insurance"</font>,»df)</codesnippet> <p id="19">In the above code snippet:</p> <list type="bulleted"> <bulletedlist>I used a SQL statement with the <b>read_csv_auto()</b> function to read a CSV file. The <b>execute()</b> function takes in this SQL statement and executes it.</bulletedlist> <bulletedlist>The <b>df()</b> function converts the result of the <b>execute()</b> function into a Pandas DataFrame object.</bulletedlist> <bulletedlist>Once the DataFrame is obtained, use the <b>register()</b> function to register it with the DuckDB database.</bulletedlist> </list> <p id="20">You can confirm the number of tables in the DuckDB by using the <b>SHOW TABLES </b>SQL statement:</p> <codesnippet>display(conn.execute(<font color="#A31515">'SHOW»TABLES'</font>).df())</codesnippet> <p id="21">The result is shown in <b>Figure 5.</b></p> <figure id="5" src="image5.png"> <b>Figure </b> <b>5:</b> The DuckDB has one associated table named insurance.</figure> <p id="22">To fetch rows from the insurance table, you can directly use a SQL statement using the <b>execute()</b> function:</p> <codesnippet>df»=»conn.execute(</codesnippet> <codesnippet>»»»»»»»»»<font color="#A31515">"SELECT»*»FROM»insurance"</font>).df()</codesnippet> <codesnippet>df</codesnippet> <p id="23">The output will be the same as shown earlier in <b>Figure 4</b>. </p> <h3>Performing Analytics using DuckDB</h3> <p id="24">Let’s now perform some useful data analytics using the <b>insurance</b> table in the DuckDB. First, I want to visualize the distribution of charges based on sex:</p> <codesnippet> <font color="Blue">import</font>»seaborn»<font color="Blue">as</font>»sns</codesnippet> <codesnippet> <font color="Blue">import</font>»matplotlib.pyplot»<font color="Blue">as</font>»plt</codesnippet> <codesnippet>f,»ax»=»plt.subplots(1,»1,»figsize=(5,»3))</codesnippet> <codesnippet>df»=»conn.execute(<font color="#A31515">'''</font></codesnippet> <codesnippet> <font color="#A31515">»»»»SELECT»</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»»»»»*»»»»»»»»</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»FROM»insurance</font> </codesnippet> <codesnippet> <font color="#A31515">'''</font>).df()</codesnippet> <codesnippet>ax»=»sns.barplot(x»=»<font color="#A31515">'region'</font>,»</codesnippet> <codesnippet>»»»»»»»»»»»»»»»»»y»=»<font color="#A31515">'charges'</font>,»</codesnippet> <codesnippet>»»»»»»»»»»»»»»»»»hue»=»<font color="#A31515">'sex'</font>,»</codesnippet> <codesnippet>»»»»»»»»»»»»»»»»»data»=»df,»</codesnippet> <codesnippet>»»»»»»»»»»»»»»»»»palette»=»<font color="#A31515">'cool'</font>,»</codesnippet> <codesnippet>»»»»»»»»»»»»»»»»»errorbar»=»<font color="#A31515">None</font>)</codesnippet> <p id="25">The above code snippet uses the Seaborn package to plot a bar plot that shows the various insurance charges for each gender in each of the four regions (see <b>Figure 6</b> for the output).</p> <figure id="6" src="image6.png"> <b>Figure </b> <b>6</b> <b>:</b> The distribution of charges for customers in each region based on sex</figure> <p id="26">Overall, men tend to have higher medical insurance cost for all regions, except the northwest region. </p> <p id="27">I’m also interested in visualizing the distribution of insurance charges for people based in the southwest region, based on the number of children a person has. I can do the following: </p> <codesnippet> <font color="Blue">import</font>»seaborn»<font color="Blue">as</font>»sns</codesnippet> <codesnippet> <font color="Blue">import</font>»matplotlib.pyplot»<font color="Blue">as</font>»plt</codesnippet> <codesnippet>f,»ax»=»plt.subplots(1,»1,»figsize=(7,»5))</codesnippet> <codesnippet>df»=»conn.execute(<font color="#A31515">'''</font></codesnippet> <codesnippet> <font color="#A31515">»»»»SELECT»</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»»»»»*»»»»»»»»</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»FROM»insurance</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»WHERE»region»=»'southwest'</font> </codesnippet> <codesnippet> <font color="#A31515">'''</font>).df()</codesnippet> <codesnippet>ax»=»sns.barplot(x»=»<font color="#A31515">'region'</font>,»</codesnippet> <codesnippet>»»»»»»»»»»»»»»»»»y»=»<font color="#A31515">'charges'</font>,</codesnippet> <codesnippet>»»»»»»»»»»»»»»»»»hue»=»<font color="#A31515">'children'</font>,»</codesnippet> <codesnippet>»»»»»»»»»»»»»»»»»data»=»df,</codesnippet> <codesnippet>»»»»»»»»»»»»»»»»»palette»=»<font color="#A31515">'Set1'</font>,»</codesnippet> <codesnippet>»»»»»»»»»»»»»»»»»errorbar»=»<font color="#A31515">None</font>)</codesnippet> <p id="28">As you can see from <b>Figure 7</b>, for the southwest region, the mean insurance charges for a person with two children is close to $17,500, which is the highest. The lowest mean insurance charges are for people with five children. </p> <figure id="7" src="image7.png"> <b>Figure </b> <b>7</b> <b>:</b> Charges for customers in the southwest region based on number of children</figure> <p id="29">Based on the value in the DataFrame <b>df</b> (which contains all the people in the southwest), you can plot linear models to examine the relationships between the various attributes (such as age, bmi, smoker, children) against the insurance charges:</p> <codesnippet>ax»=»sns.lmplot(x»=»<font color="#A31515">'age'</font>,»</codesnippet> <codesnippet>»»»»»»»»»»»»»»»»y»=»<font color="#A31515">'charges'</font>,»</codesnippet> <codesnippet>»»»»»»»»»»»»»»»»data»=»df,»</codesnippet> <codesnippet>»»»»»»»»»»»»»»»»hue»=»<font color="#A31515">'smoker'</font>,»</codesnippet> <codesnippet>»»»»»»»»»»»»»»»»palette»=»<font color="#A31515">'Set1'</font>)</codesnippet> <codesnippet>ax»=»sns.lmplot(x»=»<font color="#A31515">'bmi'</font>,»</codesnippet> <codesnippet>»»»»»»»»»»»»»»»»y»=»<font color="#A31515">'charges'</font>,»</codesnippet> <codesnippet>»»»»»»»»»»»»»»»»data»=»df,»</codesnippet> <codesnippet>»»»»»»»»»»»»»»»»hue»=»<font color="#A31515">'smoker'</font>,»</codesnippet> <codesnippet>»»»»»»»»»»»»»»»»palette»=»<font color="#A31515">'Set2'</font>)</codesnippet> <codesnippet>ax»=»sns.lmplot(x»=»<font color="#A31515">'children'</font>,»</codesnippet> <codesnippet>»»»»»»»»»»»»»»»»y»=»<font color="#A31515">'charges'</font>,»</codesnippet> <codesnippet>»»»»»»»»»»»»»»»»data»=»df,»</codesnippet> <codesnippet>»»»»»»»»»»»»»»»»hue»=»<font color="#A31515">'smoker'</font>,»</codesnippet> <codesnippet>»»»»»»»»»»»»»»»»palette»=»<font color="#A31515">'Set3'</font>)»</codesnippet> <p id="30"> <b>Figure </b> <b>8</b> shows how the insurance charges relates to age, bmi, and number of children, and whether they are smokers or not.</p> <figure id="8" src="image8.png"> <b>Figure </b> <b>8</b> <b>:</b> The relationships between the various attributes with respect to charges</figure> <p id="31">Generally, smokers have to pay much higher insurance charges compared to non-smokers. In the case of smokers, as the age or BMI increases, the amount of insurance charges increases proportionally. The number of children a customer has does not really affect the insurance charges. </p> <p id="32">Next, I want to visualize the mean insurance charges for all the people in the southeast and southwest regions, so I modify my SQL statement and plot a bar plot:</p> <codesnippet>df»=»conn.execute(<font color="#A31515">'''</font></codesnippet> <codesnippet> <font color="#A31515">»»»»SELECT»</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»»»»»region,</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»»»»»mean(charges)»as»charges»»»»»»»</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»FROM»insurance</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»WHERE»region»=»'southwest'»or»</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»»»»»»»region»=»'southeast'</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»GROUP»BY»region</font> </codesnippet> <codesnippet> <font color="#A31515">'''</font>).df()</codesnippet> <codesnippet>f,»ax»=»plt.subplots(<font>1</font>,»<font>1</font>,»figsize=(<font>5</font>,»<font>3</font>))</codesnippet> <codesnippet>ax»=»sns.barplot(x»=»<font color="#A31515">'region'</font>,»</codesnippet> <codesnippet>»»»»»»»»»»»»»»»»»y»=»<font color="#A31515">'charges'</font>,</codesnippet> <codesnippet>»»»»»»»»»»»»»»»»»data»=»df,</codesnippet> <codesnippet>»»»»»»»»»»»»»»»»»palette»=»<font color="#A31515">'Reds'</font>)</codesnippet> <p id="33"> <b>Figure </b> <b>9</b> shows the plot.</p> <figure id="9" src="image9.png"> <b>Figure </b> <b>9</b> <b>:</b> The charges for customers in the southwest and southeast regions</figure> <p id="34">Overall, the insurance charges are higher for people in the southeast region than those in the southwest region.</p> <p id="35">Next, I want to see the proportion of smokers for the entire dataset:</p> <codesnippet>palette_color»=»\</codesnippet> <codesnippet>»»»»seaborn.color_palette(<font color="#A31515">'pastel'</font>)</codesnippet> <codesnippet>plt.figure(figsize»=»(5,»5))</codesnippet> <codesnippet>df»=»conn.execute(<font color="#A31515">'''</font></codesnippet> <codesnippet> <font color="#A31515">»»»»SELECT»</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»»»»»count(*)»as»Count,</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»»»»»smoker</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»FROM»insurance</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»GROUP»BY»smoker</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»ORDER»BY»Count»DESC</font> </codesnippet> <codesnippet> <font color="#A31515">'''</font>).df()</codesnippet> <codesnippet>plt.pie(<font color="#A31515">'Count'</font>,»</codesnippet> <codesnippet>»»»»»»»»labels»=»<font color="#A31515">'smoker'</font>,»</codesnippet> <codesnippet>»»»»»»»»colors»=»palette_color,</codesnippet> <codesnippet>»»»»»»»»data»=»df,</codesnippet> <codesnippet>»»»»»»»»autopct=<font color="#A31515">'%.0f%%'</font>,)</codesnippet> <codesnippet>plt.legend(df[<font color="#A31515">'smoker'</font>],»loc=<font color="#A31515">"best"</font>)</codesnippet> <p id="36">The pie chart in <b>Figure 10</b> shows the proportion of smokers (20%) vs. non-smokers (80%).</p> <figure id="10" src="image10.png"> <b>Figure </b> <b>10</b> <b>:</b> The proportion of smokers vs. non-smokers</figure> <p id="37">It would be more useful to be able to display the numbers of smokers and non-smokers alongside their percentages. So let’s add a function named <b>fmt()</b> to customize the labels displayed on each pie on the pie chart:</p> <codesnippet> <font color="Green">#»</font> <font color="Green">sum»up»total»number»of»people</font> </codesnippet> <codesnippet> <font>total»=»df[</font> <font color="#A31515">'Count'</font> <font>].sum()</font> </codesnippet> <codesnippet> <font color="Blue">def</font> <font>»</font> <font color="#A31515">fmt</font> <font>(x):»»»»</font> </codesnippet> <codesnippet> <font color="Green">»»»»#»display»percentage»followed»by»number</font> </codesnippet> <codesnippet> <font>»»»»</font> <font color="Blue">return</font> <font>»</font> <font color="#A31515">'{:.2f}%\n({:.0f})'</font> <font>.format(</font> </codesnippet> <codesnippet> <font>»»»»»»»»x,»total»*»x»/»100)</font> </codesnippet> <codesnippet>palette_color»=»\</codesnippet> <codesnippet>»»»»seaborn.color_palette(<font color="#A31515">'pastel'</font>)</codesnippet> <codesnippet>plt.figure(figsize»=»(5,»5))</codesnippet> <codesnippet>plt.pie(<font color="#A31515">'Count'</font>,»</codesnippet> <codesnippet>»»»»»»»»labels»=»<font color="#A31515">'smoker'</font>,»</codesnippet> <codesnippet>»»»»»»»»colors»=»palette_color,</codesnippet> <codesnippet>»»»»»»»»data»=»df,</codesnippet> <codesnippet>»»»»»»»»<font>autopct»=»fmt</font>)»»<font color="Green">#»cal</font><font color="Green">l</font><font color="Green">»fmt()</font></codesnippet> <codesnippet>plt.legend(df[<font color="#A31515">'smoker'</font>],»loc=<font color="#A31515">"best"</font>)</codesnippet> <p id="38"> <b>Figure </b> <b>11</b> shows the updated pie chart.</p> <figure id="11" src="image11.png"> <b>Figure </b> <b>11</b> <b>:</b> The updated pie chart with the numbers on each pie</figure> <h2>JSON Ingestion</h2> <p id="39">One of the new features announced in the recent DuckDB release (version 0.7) is the support for JSON Ingestion. Basically, this means that you can now directly load JSON files into your DuckDB databases. In this section, I’ll show you how to use this new feature by showing you some examples.</p> <p id="40">For the first example, suppose you have a JSON file named <b>json0.json</b> with the following content:</p> <codesnippet>[</codesnippet> <codesnippet>»»{</codesnippet> <codesnippet>»»»»<font color="Red">"id"</font>:»1,</codesnippet> <codesnippet>»»»»<font color="Red">"name"</font>:»"Abigail",</codesnippet> <codesnippet>»»»»<font color="Red">"address"</font>:»"711-2880»Nulla»St.»</codesnippet> <codesnippet>»»»»»»»»»»»»»»»Mankato»Mississippi»96522"»</codesnippet> <codesnippet>»»},</codesnippet> <codesnippet>»»{</codesnippet> <codesnippet>»»»»<font color="Red">"id"</font>:»2,</codesnippet> <codesnippet>»»»»<font color="Red">"name"</font>:»"Diana",</codesnippet> <codesnippet>»»»»<font color="Red">"address"</font>:»"P.O.»Box»283»8562»</codesnippet> <codesnippet>»»»»»»Fusce»Rd.»Frederick»Nebraska»20620"»</codesnippet> <codesnippet>»»},</codesnippet> <codesnippet>»»{</codesnippet> <codesnippet>»»»»<font color="Red">"id"</font>:»3,</codesnippet> <codesnippet>»»»»<font color="Red">"name"</font>:»"Jason",</codesnippet> <codesnippet>»»»»<font color="Red">"address"</font>:»"606-3727»Ullamcorper.</codesnippet> <codesnippet>»»»»»»Street»Roseville»NH»11523"»</codesnippet> <codesnippet>»»}</codesnippet> <codesnippet>]</codesnippet> <p id="41">This JSON file contains an array of objects, with each object containing three key/value pairs. This is a very simple JSON file and the easiest way to read it into DuckDB is to use the <b>read_json_auto()</b> function:</p> <codesnippet> <font color="Blue">import</font>»duckdb</codesnippet> <codesnippet>conn»=»duckdb.connect()</codesnippet> <codesnippet>conn.execute(<font color="#A31515">'''</font></codesnippet> <codesnippet> <font color="#A31515">»»»»SELECT»</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»»»»»*»</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»FROM»read_json_auto('json0.json')</font> </codesnippet> <codesnippet> <font color="#A31515">'''</font>).df()</codesnippet> <p id="42">The output of the above code snippet is as shown in <b>Figure 12</b>.</p> <figure id="12" src="image12.png"> <b>Figure </b> <b>12</b> <b>:</b> The JSON file reads as a Pandas DataFrame</figure> <p id="43">If you only want the <b>id</b> and <b>name</b> fields, modify the SQL statement as follows:</p> <codesnippet> <font color="Blue">import</font>»duckdb</codesnippet> <codesnippet>conn»=»duckdb.connect()</codesnippet> <codesnippet>conn.execute(<font color="#A31515">'''</font></codesnippet> <codesnippet> <font color="#A31515">»»»»SELECT»</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»»»»»id,»name»</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»FROM»read_json_auto('json0.json')</font> </codesnippet> <codesnippet> <font color="#A31515">'''</font>).df()</codesnippet> <p id="44">You’ll now only get the <b>id</b> and <b>name</b> columns (see <b>Figure 13</b>)</p> <figure id="13" src="image13.png"> <b>Figure </b> <b>13</b> <b>:</b> The DataFrame now only contains the ID and name columns.</figure> <p id="45">The <b>read_json_auto()</b> function automatically reads the entire JSON file into DuckDB. If you only want to selectively read specific keys in the JSON file, use the <b>read_json()</b> function and specify the <b>json_format</b> and <b>columns</b> attributes as follows:</p> <codesnippet> <font color="Blue">import</font>»duckdb</codesnippet> <codesnippet>conn»=»duckdb.connect()</codesnippet> <codesnippet>conn.execute(<font color="#A31515">'''</font></codesnippet> <codesnippet> <font color="#A31515">»»»»SELECT»</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»»»»»*»</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»FROM»read_json('json0.json',</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»json_format»=»'array_of_records',</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»columns»=»{</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»»»»»id:'INTEGER',»</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»»»»»name:'STRING'</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»})</font> </codesnippet> <codesnippet> <font color="#A31515">'''</font>).df()</codesnippet> <p id="46">The output is the same as shown in <b>Figure 13</b>.</p> <p id="47">Suppose now the content of <b>json0.json</b> is slightly changed and now saved in another file named <b>json1.json</b>:</p> <codesnippet>[</codesnippet> <codesnippet>»»{</codesnippet> <codesnippet>»»»»<font color="Red">"id"</font>:»1,</codesnippet> <codesnippet>»»»»<font color="Red">"name"</font>:»<font color="#A31515">"Abigail"</font>,</codesnippet> <codesnippet>»»»»<font color="Red">"address"</font>:»{</codesnippet> <codesnippet>»»»»»»<font color="Red">"line1"</font>:<font color="#A31515">"711-2880»Nulla»St.»</font></codesnippet> <codesnippet> <font color="#A31515">»»»»»»»»»»»»»»»</font> <font color="#A31515">Mankato"</font>,»»»»»»»»»»»»</codesnippet> <codesnippet>»»»»»»<font color="Red">"state"</font>:<font color="#A31515">"Mississippi"</font>,</codesnippet> <codesnippet>»»»»»»<font color="Red">"zip"</font>:96522</codesnippet> <codesnippet>»»»»}</codesnippet> <codesnippet>»»},</codesnippet> <codesnippet>»»{</codesnippet> <codesnippet>»»»»<font color="Red">"id"</font>:»2,</codesnippet> <codesnippet>»»»»<font color="Red">"name"</font>:»<font color="#A31515">"Diana"</font>,</codesnippet> <codesnippet>»»»»<font color="Red">"address"</font>:»{»»»»»»»»»»»»</codesnippet> <codesnippet>»»»»»»<font color="Red">"line1"</font>:<font color="#A31515">"P.O.»Box»283»8562"</font>,»</codesnippet> <codesnippet>»»»»»»<font color="Red">"line2"</font>:<font color="#A31515">"Fusce»Rd.»Frederick"</font>,</codesnippet> <codesnippet>»»»»»»<font color="Red">"state"</font>:<font color="#A31515">"Nebraska"</font>,</codesnippet> <codesnippet>»»»»»»<font color="Red">"zip"</font>:20620</codesnippet> <codesnippet>»»»»}</codesnippet> <codesnippet>»»},</codesnippet> <codesnippet>»»{</codesnippet> <codesnippet>»»»»<font color="Red">"id"</font>:»3,</codesnippet> <codesnippet>»»»»<font color="Red">"name"</font>:»<font color="#A31515">"Jason"</font>,</codesnippet> <codesnippet>»»»»<font color="Red">"address"</font>:»{</codesnippet> <codesnippet>»»»»»»<font color="Red">"line1"</font>:»<font color="#A31515">"606-3727»Ullamcorper"</font>,</codesnippet> <codesnippet>»»»»»»<font color="Red">"line2"</font>:<font color="#A31515">"Street»Roseville"</font>,</codesnippet> <codesnippet>»»»»»»<font color="Red">"state"</font>:<font color="#A31515">"NH"</font>,</codesnippet> <codesnippet>»»»»»»<font color="Red">"zip"</font>:»11523»»»</codesnippet> <codesnippet>»»»»}</codesnippet> <codesnippet>»»}</codesnippet> <codesnippet>]</codesnippet> <p id="48">As you can see, the address of each person is now split into four different key/value pairs: <b>line1</b>, <b>line2</b>, <b>state</b>, and <b>zip</b>. Notice that the first person does not have line2 attribute in its address.</p> <p id="49">As usual, you can use the <b>read_json_auto()</b> function to read the JSON file:</p> <codesnippet> <font color="Blue">import</font>»duckdb</codesnippet> <codesnippet>conn»=»duckdb.connect()</codesnippet> <codesnippet>conn.execute(<font color="#A31515">'''</font></codesnippet> <codesnippet> <font color="#A31515">»»»»SELECT»</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»»»»»*</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»FROM»read_json_auto('json1.json')</font> </codesnippet> <codesnippet> <font color="#A31515">'''</font>).df()</codesnippet> <p id="50">Observe that the address field is now a string containing all the four key/value pairs (see <b>Figure 14</b>).</p> <figure id="14" src="image14.png"> <b>Figure </b> <b>14</b> <b>:</b> The various parts of an address are all squeezed into a single column.</figure> <p id="51">So how do you extract the key value pairs of the addresses as individual columns? Fortunately you can do so via the SQL statement:</p> <codesnippet> <font color="Blue">import</font>»duckdb</codesnippet> <codesnippet>conn»=»duckdb.connect()</codesnippet> <codesnippet>conn.execute(<font color="#A31515">'''</font></codesnippet> <codesnippet> <font color="#A31515">»»»»SELECT»</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»»»»»id,</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»»»»»name,</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»»»»»address['line1']»as»line1,</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»»»»»address['line2']»as»line2,</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»»»»»address['state']»as»state,</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»»»»»address['zip']»as»zip,»»»»»»»»</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»FROM»read_json_auto('json1.json')</font> </codesnippet> <codesnippet> <font color="#A31515">'''</font>).df()</codesnippet> <p id="52"> <b>Figure </b> <b>15</b> now shows the key/value pairs for the addresses extracted as individual columns.</p> <figure id="15" src="image15.png"> <b>Figure </b> <b>15</b> <b>:</b> The addresses are now split into multiple columns.</figure> <p id="53">As the first person doesn’t have the <b>line2</b> attribute, it has a <b>NaN</b> value in the DataFrame.</p> <p id="54">Consider another example (json2.json) where all the information you’ve seen in the previous examples are encapsulated within the <b>people</b> key:</p> <codesnippet>{</codesnippet> <codesnippet>»»<font color="Red">"people"</font>:»[</codesnippet> <codesnippet>»»»»{</codesnippet> <codesnippet>»»»»»»<font color="Red">"id"</font>:»1,</codesnippet> <codesnippet>»»»»»»<font color="Red">"name"</font>:»<font color="#A31515">"Abigail"</font>,</codesnippet> <codesnippet>»»»»»»<font color="Red">"address"</font>:»{</codesnippet> <codesnippet>»»»»»»»»<font color="Red">"line1"</font>:<font color="#A31515">"711-2880»Nulla»St.»Mankato"</font>,</codesnippet> <codesnippet>»»»»»»»»<font color="Red">"state"</font>:<font color="#A31515">"Mississippi"</font>,</codesnippet> <codesnippet>»»»»»»»»<font color="Red">"zip"</font>:96522</codesnippet> <codesnippet>»»»»»»}</codesnippet> <codesnippet>»»»»},</codesnippet> <codesnippet>»»»»{</codesnippet> <codesnippet>»»»»»»<font color="Red">"id"</font>:»2,</codesnippet> <codesnippet>»»»»»»<font color="Red">"name"</font>:»<font color="#A31515">"Diana"</font>,</codesnippet> <codesnippet>»»»»»»<font color="Red">"address"</font>:»{»»»»»»»»»»»»</codesnippet> <codesnippet>»»»»»»»»<font color="Red">"line1"</font>:<font color="#A31515">"P.O.»Box»283»8562»"</font>,»</codesnippet> <codesnippet>»»»»»»»»<font color="Red">"line2"</font>:<font color="#A31515">"Fusce»Rd.»Frederick"</font>,</codesnippet> <codesnippet>»»»»»»»»<font color="Red">"state"</font>:<font color="#A31515">"Nebraska"</font>,</codesnippet> <codesnippet>»»»»»»»»<font color="Red">"zip"</font>:20620</codesnippet> <codesnippet>»»»»»»}</codesnippet> <codesnippet>»»»»},</codesnippet> <codesnippet>»»»»{</codesnippet> <codesnippet>»»»»»»<font color="Red">"id"</font>:»3,</codesnippet> <codesnippet>»»»»»»<font color="Red">"name"</font>:»<font color="#A31515">"Jason"</font>,</codesnippet> <codesnippet>»»»»»»<font color="Red">"address"</font>:»{</codesnippet> <codesnippet>»»»»»»»»<font color="Red">"line1"</font>:»<font color="#A31515">"606-3727»Ullamcorper"</font>,</codesnippet> <codesnippet>»»»»»»»»<font color="Red">"line2"</font>:<font color="#A31515">"Street»Roseville"</font>,</codesnippet> <codesnippet>»»»»»»»»<font color="Red">"state"</font>:<font color="#A31515">"NH"</font>,</codesnippet> <codesnippet>»»»»»»»»<font color="Red">"zip"</font>:»11523»»»</codesnippet> <codesnippet>»»»»»»}</codesnippet> <codesnippet>»»»»}</codesnippet> <codesnippet>»»]</codesnippet> <codesnippet>}</codesnippet> <p id="55">If you try to load it using <b>read_json_auto()</b>, the people key will be read as a single column (see <b>Figure 16</b>).</p> <codesnippet> <font color="Blue">import</font>»duckdb</codesnippet> <codesnippet>conn»=»duckdb.connect()</codesnippet> <codesnippet> <font color="Blue">import</font>»json</codesnippet> <codesnippet>conn.execute(<font color="#A31515">'''</font></codesnippet> <codesnippet> <font color="#A31515">»»»»SELECT»</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»»»»»*»»»»»»»»</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»FROM»read_json_auto('json2.json')</font> </codesnippet> <codesnippet> <font color="#A31515">'''</font>).df()</codesnippet> <figure id="16" src="image16.png"> <b>Figure </b> <b>16</b> <b>:</b> All of the details are stored in the people column.</figure> <p id="56">To properly read it into a Pandas DataFrame, you can use the <b>unnest()</b> function in SQL:</p> <codesnippet> <font color="Blue">import</font>»duckdb</codesnippet> <codesnippet>conn»=»duckdb.connect()</codesnippet> <codesnippet> <font color="Blue">import</font>»json</codesnippet> <codesnippet>conn.execute(<font color="#A31515">'''</font></codesnippet> <codesnippet> <font color="#A31515">»»»»SELECT»</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»»»»»p.id,</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»»»»»p.name,</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»»»»»p.address['line1']»as»line1,</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»»»»»p.address['line2']»as»line2,</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»»»»»p.address['state']»as»state,</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»»»»»p.address['zip']»as»zip,»»»»»</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»FROM»(</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»»»»»</font> <font color="#A31515">SELECT»unnest(people)»p»</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»»»»»</font> <font color="#A31515">FROM»read_json_auto('json2.json')</font> </codesnippet> <codesnippet> <font color="#A31515">»»»»</font> <font color="#A31515">)</font> </codesnippet> <codesnippet> <font color="#A31515">'''</font>).df()</codesnippet> <p id="57">The JSON file is now loaded correctly, just like <b>Figure 15</b>. </p> <h2>Summary</h2> <p id="58">If you’re a hardcore SQL developer, using DuckDB is a godsend when performing data analytics. Instead of manipulating DataFrames, you could now use your SQL knowledge to manipulate the data in whatever form you want. Of course, you still need a good working knowledge of DataFrames, but the bulk of the analytics part can be performed using SQL. In addition, JSON support in the latest version of DuckDB is definitely useful. Even so, loading the JSON properly into the required shape definitely takes some getting used to. Things are still rapidly evolving and hopefully, the next version will make things even easier!</p> </body> <sidebars /> <tables /> <codelistings /> </document>