TL;DR
Making use of SQLAlchemy and SQL queries, I extract EU industry production data for further analysis from the PostgreSQL database where I previously stored it.
Long Description
Building on the previous projects, I use SQLAlchemy to connect to a local PostgreSQL database that contains as a table an EU industry production dataset from the EU Open Data Portal. I study its size (rows and columns), identify the index columns and decipher their meaning using the online documentation of the dataset and specific SQL queries. Due to the fragmented structure of the documentation, finding the right information is rather tedious. This demonstrates some common pitfalls of working with third-party data. With the new-found knowledge, I extract a subset of the dataset (manufacturing index for France, adjusted for calendar and season) and load it into a Pandas dataframe.
Table of contents
Project Background
The data used for data science projects are often stored in SQL databases. However, the datasets can be very large and might not be needed in their entirety for a specific task. Therefore, it is useful to only extract the desired portion of the dataset for further work in Python.
Extracting a specific part of a dataset requires sufficient knowledge about the structure and meaning of the tables in the database. Here I will show how data can be extracted from a rather unclean table with a documentation that is all over the place. The dataset is the industry production data from the EU Open Data Portal that I previously ingested into a PostgreSQL database.
Connecting to the database
As in the previous project where I stored the data in the PostgreSQL database, I connect to it using the create_engine()
function and the connect()
method from the SQLAlchemy package:
import sqlalchemy
# Connect to database (Note: The package psychopg2 is required for Postgres to work with SQLAlchemy)
engine = sqlalchemy.create_engine("postgresql://postgres:xfkLVeMj@localhost/production")
con = engine.connect()
There is just one table in the database:
print(engine.table_names())
['industry_production']
Getting an overview of the table
Table reflection
Let’s see how the table is structured. This can be done using table reflections (see SQLAlchemy documentation). First, a MetaData
object has to be created. The instance has then to be passed to the Table
class, along with the table name and some autoload arguments to associate it with engine
:
metadata = sqlalchemy.MetaData()
industry_production = sqlalchemy.Table('industry_production', metadata, autoload=True, autoload_with=engine)
The columns of the table can now be displayed with the columns
attribute:
print(industry_production.columns)
['industry_production.indic_bt', 'industry_production.nace_r2', 'industry_production.s_adj', 'industry_production.unit', 'industry_production.geo\\time', 'industry_production.2017M08 ', 'industry_production.2017M07 ', 'industry_production.2017M06 ', 'industry_production.2017M05 ', 'industry_production.2017M04 ', 'industry_production.2017M03 ', 'industry_production.2017M02 ', 'industry_production.2017M01 ', 'industry_production.2016M12 ', 'industry_production.2016M11 ', 'industry_production.2016M10 ', 'industry_production.2016M09 ', 'industry_production.2016M08 ', 'industry_production.2016M07 ', 'industry_production.2016M06 ', 'industry_production.2016M05 ', 'industry_production.2016M04 ', 'industry_production.2016M03 ', 'industry_production.2016M02 ', 'industry_production.2016M01 ', 'industry_production.2015M12 ', 'industry_production.2015M11 ', 'industry_production.2015M10 ', 'industry_production.2015M09 ', 'industry_production.2015M08 ', 'industry_production.2015M07 ', 'industry_production.2015M06 ', 'industry_production.2015M05 ', 'industry_production.2015M04 ', 'industry_production.2015M03 ', 'industry_production.2015M02 ', 'industry_production.2015M01 ', 'industry_production.2014M12 ', 'industry_production.2014M11 ', 'industry_production.2014M10 ', 'industry_production.2014M09 ', 'industry_production.2014M08 ', 'industry_production.2014M07 ', 'industry_production.2014M06 ', 'industry_production.2014M05 ', 'industry_production.2014M04 ', 'industry_production.2014M03 ', 'industry_production.2014M02 ', 'industry_production.2014M01 ', 'industry_production.2013M12 ', 'industry_production.2013M11 ', 'industry_production.2013M10 ', 'industry_production.2013M09 ', 'industry_production.2013M08 ', 'industry_production.2013M07 ', 'industry_production.2013M06 ', 'industry_production.2013M05 ', 'industry_production.2013M04 ', 'industry_production.2013M03 ', 'industry_production.2013M02 ', 'industry_production.2013M01 ', 'industry_production.2012M12 ', 'industry_production.2012M11 ', 'industry_production.2012M10 ', 'industry_production.2012M09 ', 'industry_production.2012M08 ', 'industry_production.2012M07 ', 'industry_production.2012M06 ', 'industry_production.2012M05 ', 'industry_production.2012M04 ', 'industry_production.2012M03 ', 'industry_production.2012M02 ', 'industry_production.2012M01 ', 'industry_production.2011M12 ', 'industry_production.2011M11 ', 'industry_production.2011M10 ', 'industry_production.2011M09 ', 'industry_production.2011M08 ', 'industry_production.2011M07 ', 'industry_production.2011M06 ', 'industry_production.2011M05 ', 'industry_production.2011M04 ', 'industry_production.2011M03 ', 'industry_production.2011M02 ', 'industry_production.2011M01 ', 'industry_production.2010M12 ', 'industry_production.2010M11 ', 'industry_production.2010M10 ', 'industry_production.2010M09 ', 'industry_production.2010M08 ', 'industry_production.2010M07 ', 'industry_production.2010M06 ', 'industry_production.2010M05 ', 'industry_production.2010M04 ', 'industry_production.2010M03 ', 'industry_production.2010M02 ', 'industry_production.2010M01 ', 'industry_production.2009M12 ', 'industry_production.2009M11 ', 'industry_production.2009M10 ', 'industry_production.2009M09 ', 'industry_production.2009M08 ', 'industry_production.2009M07 ', 'industry_production.2009M06 ', 'industry_production.2009M05 ', 'industry_production.2009M04 ', 'industry_production.2009M03 ', 'industry_production.2009M02 ', 'industry_production.2009M01 ', 'industry_production.2008M12 ', 'industry_production.2008M11 ', 'industry_production.2008M10 ', 'industry_production.2008M09 ', 'industry_production.2008M08 ', 'industry_production.2008M07 ', 'industry_production.2008M06 ', 'industry_production.2008M05 ', 'industry_production.2008M04 ', 'industry_production.2008M03 ', 'industry_production.2008M02 ', 'industry_production.2008M01 ', 'industry_production.2007M12 ', 'industry_production.2007M11 ', 'industry_production.2007M10 ', 'industry_production.2007M09 ', 'industry_production.2007M08 ', 'industry_production.2007M07 ', 'industry_production.2007M06 ', 'industry_production.2007M05 ', 'industry_production.2007M04 ', 'industry_production.2007M03 ', 'industry_production.2007M02 ', 'industry_production.2007M01 ', 'industry_production.2006M12 ', 'industry_production.2006M11 ', 'industry_production.2006M10 ', 'industry_production.2006M09 ', 'industry_production.2006M08 ', 'industry_production.2006M07 ', 'industry_production.2006M06 ', 'industry_production.2006M05 ', 'industry_production.2006M04 ', 'industry_production.2006M03 ', 'industry_production.2006M02 ', 'industry_production.2006M01 ', 'industry_production.2005M12 ', 'industry_production.2005M11 ', 'industry_production.2005M10 ', 'industry_production.2005M09 ', 'industry_production.2005M08 ', 'industry_production.2005M07 ', 'industry_production.2005M06 ', 'industry_production.2005M05 ', 'industry_production.2005M04 ', 'industry_production.2005M03 ', 'industry_production.2005M02 ', 'industry_production.2005M01 ', 'industry_production.2004M12 ', 'industry_production.2004M11 ', 'industry_production.2004M10 ', 'industry_production.2004M09 ', 'industry_production.2004M08 ', 'industry_production.2004M07 ', 'industry_production.2004M06 ', 'industry_production.2004M05 ', 'industry_production.2004M04 ', 'industry_production.2004M03 ', 'industry_production.2004M02 ', 'industry_production.2004M01 ', 'industry_production.2003M12 ', 'industry_production.2003M11 ', 'industry_production.2003M10 ', 'industry_production.2003M09 ', 'industry_production.2003M08 ', 'industry_production.2003M07 ', 'industry_production.2003M06 ', 'industry_production.2003M05 ', 'industry_production.2003M04 ', 'industry_production.2003M03 ', 'industry_production.2003M02 ', 'industry_production.2003M01 ', 'industry_production.2002M12 ', 'industry_production.2002M11 ', 'industry_production.2002M10 ', 'industry_production.2002M09 ', 'industry_production.2002M08 ', 'industry_production.2002M07 ', 'industry_production.2002M06 ', 'industry_production.2002M05 ', 'industry_production.2002M04 ', 'industry_production.2002M03 ', 'industry_production.2002M02 ', 'industry_production.2002M01 ', 'industry_production.2001M12 ', 'industry_production.2001M11 ', 'industry_production.2001M10 ', 'industry_production.2001M09 ', 'industry_production.2001M08 ', 'industry_production.2001M07 ', 'industry_production.2001M06 ', 'industry_production.2001M05 ', 'industry_production.2001M04 ', 'industry_production.2001M03 ', 'industry_production.2001M02 ', 'industry_production.2001M01 ', 'industry_production.2000M12 ', 'industry_production.2000M11 ', 'industry_production.2000M10 ', 'industry_production.2000M09 ', 'industry_production.2000M08 ', 'industry_production.2000M07 ', 'industry_production.2000M06 ', 'industry_production.2000M05 ', 'industry_production.2000M04 ', 'industry_production.2000M03 ', 'industry_production.2000M02 ', 'industry_production.2000M01 ', 'industry_production.1999M12 ', 'industry_production.1999M11 ', 'industry_production.1999M10 ', 'industry_production.1999M09 ', 'industry_production.1999M08 ', 'industry_production.1999M07 ', 'industry_production.1999M06 ', 'industry_production.1999M05 ', 'industry_production.1999M04 ', 'industry_production.1999M03 ', 'industry_production.1999M02 ', 'industry_production.1999M01 ', 'industry_production.1998M12 ', 'industry_production.1998M11 ', 'industry_production.1998M10 ', 'industry_production.1998M09 ', 'industry_production.1998M08 ', 'industry_production.1998M07 ', 'industry_production.1998M06 ', 'industry_production.1998M05 ', 'industry_production.1998M04 ', 'industry_production.1998M03 ', 'industry_production.1998M02 ', 'industry_production.1998M01 ', 'industry_production.1997M12 ', 'industry_production.1997M11 ', 'industry_production.1997M10 ', 'industry_production.1997M09 ', 'industry_production.1997M08 ', 'industry_production.1997M07 ', 'industry_production.1997M06 ', 'industry_production.1997M05 ', 'industry_production.1997M04 ', 'industry_production.1997M03 ', 'industry_production.1997M02 ', 'industry_production.1997M01 ', 'industry_production.1996M12 ', 'industry_production.1996M11 ', 'industry_production.1996M10 ', 'industry_production.1996M09 ', 'industry_production.1996M08 ', 'industry_production.1996M07 ', 'industry_production.1996M06 ', 'industry_production.1996M05 ', 'industry_production.1996M04 ', 'industry_production.1996M03 ', 'industry_production.1996M02 ', 'industry_production.1996M01 ', 'industry_production.1995M12 ', 'industry_production.1995M11 ', 'industry_production.1995M10 ', 'industry_production.1995M09 ', 'industry_production.1995M08 ', 'industry_production.1995M07 ', 'industry_production.1995M06 ', 'industry_production.1995M05 ', 'industry_production.1995M04 ', 'industry_production.1995M03 ', 'industry_production.1995M02 ', 'industry_production.1995M01 ', 'industry_production.1994M12 ', 'industry_production.1994M11 ', 'industry_production.1994M10 ', 'industry_production.1994M09 ', 'industry_production.1994M08 ', 'industry_production.1994M07 ', 'industry_production.1994M06 ', 'industry_production.1994M05 ', 'industry_production.1994M04 ', 'industry_production.1994M03 ', 'industry_production.1994M02 ', 'industry_production.1994M01 ', 'industry_production.1993M12 ', 'industry_production.1993M11 ', 'industry_production.1993M10 ', 'industry_production.1993M09 ', 'industry_production.1993M08 ', 'industry_production.1993M07 ', 'industry_production.1993M06 ', 'industry_production.1993M05 ', 'industry_production.1993M04 ', 'industry_production.1993M03 ', 'industry_production.1993M02 ', 'industry_production.1993M01 ', 'industry_production.1992M12 ', 'industry_production.1992M11 ', 'industry_production.1992M10 ', 'industry_production.1992M09 ', 'industry_production.1992M08 ', 'industry_production.1992M07 ', 'industry_production.1992M06 ', 'industry_production.1992M05 ', 'industry_production.1992M04 ', 'industry_production.1992M03 ', 'industry_production.1992M02 ', 'industry_production.1992M01 ', 'industry_production.1991M12 ', 'industry_production.1991M11 ', 'industry_production.1991M10 ', 'industry_production.1991M09 ', 'industry_production.1991M08 ', 'industry_production.1991M07 ', 'industry_production.1991M06 ', 'industry_production.1991M05 ', 'industry_production.1991M04 ', 'industry_production.1991M03 ', 'industry_production.1991M02 ', 'industry_production.1991M01 ', 'industry_production.1990M12 ', 'industry_production.1990M11 ', 'industry_production.1990M10 ', 'industry_production.1990M09 ', 'industry_production.1990M08 ', 'industry_production.1990M07 ', 'industry_production.1990M06 ', 'industry_production.1990M05 ', 'industry_production.1990M04 ', 'industry_production.1990M03 ', 'industry_production.1990M02 ', 'industry_production.1990M01 ', 'industry_production.1989M12 ', 'industry_production.1989M11 ', 'industry_production.1989M10 ', 'industry_production.1989M09 ', 'industry_production.1989M08 ', 'industry_production.1989M07 ', 'industry_production.1989M06 ', 'industry_production.1989M05 ', 'industry_production.1989M04 ', 'industry_production.1989M03 ', 'industry_production.1989M02 ', 'industry_production.1989M01 ', 'industry_production.1988M12 ', 'industry_production.1988M11 ', 'industry_production.1988M10 ', 'industry_production.1988M09 ', 'industry_production.1988M08 ', 'industry_production.1988M07 ', 'industry_production.1988M06 ', 'industry_production.1988M05 ', 'industry_production.1988M04 ', 'industry_production.1988M03 ', 'industry_production.1988M02 ', 'industry_production.1988M01 ', 'industry_production.1987M12 ', 'industry_production.1987M11 ', 'industry_production.1987M10 ', 'industry_production.1987M09 ', 'industry_production.1987M08 ', 'industry_production.1987M07 ', 'industry_production.1987M06 ', 'industry_production.1987M05 ', 'industry_production.1987M04 ', 'industry_production.1987M03 ', 'industry_production.1987M02 ', 'industry_production.1987M01 ', 'industry_production.1986M12 ', 'industry_production.1986M11 ', 'industry_production.1986M10 ', 'industry_production.1986M09 ', 'industry_production.1986M08 ', 'industry_production.1986M07 ', 'industry_production.1986M06 ', 'industry_production.1986M05 ', 'industry_production.1986M04 ', 'industry_production.1986M03 ', 'industry_production.1986M02 ', 'industry_production.1986M01 ', 'industry_production.1985M12 ', 'industry_production.1985M11 ', 'industry_production.1985M10 ', 'industry_production.1985M09 ', 'industry_production.1985M08 ', 'industry_production.1985M07 ', 'industry_production.1985M06 ', 'industry_production.1985M05 ', 'industry_production.1985M04 ', 'industry_production.1985M03 ', 'industry_production.1985M02 ', 'industry_production.1985M01 ', 'industry_production.1984M12 ', 'industry_production.1984M11 ', 'industry_production.1984M10 ', 'industry_production.1984M09 ', 'industry_production.1984M08 ', 'industry_production.1984M07 ', 'industry_production.1984M06 ', 'industry_production.1984M05 ', 'industry_production.1984M04 ', 'industry_production.1984M03 ', 'industry_production.1984M02 ', 'industry_production.1984M01 ', 'industry_production.1983M12 ', 'industry_production.1983M11 ', 'industry_production.1983M10 ', 'industry_production.1983M09 ', 'industry_production.1983M08 ', 'industry_production.1983M07 ', 'industry_production.1983M06 ', 'industry_production.1983M05 ', 'industry_production.1983M04 ', 'industry_production.1983M03 ', 'industry_production.1983M02 ', 'industry_production.1983M01 ', 'industry_production.1982M12 ', 'industry_production.1982M11 ', 'industry_production.1982M10 ', 'industry_production.1982M09 ', 'industry_production.1982M08 ', 'industry_production.1982M07 ', 'industry_production.1982M06 ', 'industry_production.1982M05 ', 'industry_production.1982M04 ', 'industry_production.1982M03 ', 'industry_production.1982M02 ', 'industry_production.1982M01 ', 'industry_production.1981M12 ', 'industry_production.1981M11 ', 'industry_production.1981M10 ', 'industry_production.1981M09 ', 'industry_production.1981M08 ', 'industry_production.1981M07 ', 'industry_production.1981M06 ', 'industry_production.1981M05 ', 'industry_production.1981M04 ', 'industry_production.1981M03 ', 'industry_production.1981M02 ', 'industry_production.1981M01 ', 'industry_production.1980M12 ', 'industry_production.1980M11 ', 'industry_production.1980M10 ', 'industry_production.1980M09 ', 'industry_production.1980M08 ', 'industry_production.1980M07 ', 'industry_production.1980M06 ', 'industry_production.1980M05 ', 'industry_production.1980M04 ', 'industry_production.1980M03 ', 'industry_production.1980M02 ', 'industry_production.1980M01 ', 'industry_production.1979M12 ', 'industry_production.1979M11 ', 'industry_production.1979M10 ', 'industry_production.1979M09 ', 'industry_production.1979M08 ', 'industry_production.1979M07 ', 'industry_production.1979M06 ', 'industry_production.1979M05 ', 'industry_production.1979M04 ', 'industry_production.1979M03 ', 'industry_production.1979M02 ', 'industry_production.1979M01 ', 'industry_production.1978M12 ', 'industry_production.1978M11 ', 'industry_production.1978M10 ', 'industry_production.1978M09 ', 'industry_production.1978M08 ', 'industry_production.1978M07 ', 'industry_production.1978M06 ', 'industry_production.1978M05 ', 'industry_production.1978M04 ', 'industry_production.1978M03 ', 'industry_production.1978M02 ', 'industry_production.1978M01 ', 'industry_production.1977M12 ', 'industry_production.1977M11 ', 'industry_production.1977M10 ', 'industry_production.1977M09 ', 'industry_production.1977M08 ', 'industry_production.1977M07 ', 'industry_production.1977M06 ', 'industry_production.1977M05 ', 'industry_production.1977M04 ', 'industry_production.1977M03 ', 'industry_production.1977M02 ', 'industry_production.1977M01 ', 'industry_production.1976M12 ', 'industry_production.1976M11 ', 'industry_production.1976M10 ', 'industry_production.1976M09 ', 'industry_production.1976M08 ', 'industry_production.1976M07 ', 'industry_production.1976M06 ', 'industry_production.1976M05 ', 'industry_production.1976M04 ', 'industry_production.1976M03 ', 'industry_production.1976M02 ', 'industry_production.1976M01 ', 'industry_production.1975M12 ', 'industry_production.1975M11 ', 'industry_production.1975M10 ', 'industry_production.1975M09 ', 'industry_production.1975M08 ', 'industry_production.1975M07 ', 'industry_production.1975M06 ', 'industry_production.1975M05 ', 'industry_production.1975M04 ', 'industry_production.1975M03 ', 'industry_production.1975M02 ', 'industry_production.1975M01 ', 'industry_production.1974M12 ', 'industry_production.1974M11 ', 'industry_production.1974M10 ', 'industry_production.1974M09 ', 'industry_production.1974M08 ', 'industry_production.1974M07 ', 'industry_production.1974M06 ', 'industry_production.1974M05 ', 'industry_production.1974M04 ', 'industry_production.1974M03 ', 'industry_production.1974M02 ', 'industry_production.1974M01 ', 'industry_production.1973M12 ', 'industry_production.1973M11 ', 'industry_production.1973M10 ', 'industry_production.1973M09 ', 'industry_production.1973M08 ', 'industry_production.1973M07 ', 'industry_production.1973M06 ', 'industry_production.1973M05 ', 'industry_production.1973M04 ', 'industry_production.1973M03 ', 'industry_production.1973M02 ', 'industry_production.1973M01 ', 'industry_production.1972M12 ', 'industry_production.1972M11 ', 'industry_production.1972M10 ', 'industry_production.1972M09 ', 'industry_production.1972M08 ', 'industry_production.1972M07 ', 'industry_production.1972M06 ', 'industry_production.1972M05 ', 'industry_production.1972M04 ', 'industry_production.1972M03 ', 'industry_production.1972M02 ', 'industry_production.1972M01 ', 'industry_production.1971M12 ', 'industry_production.1971M11 ', 'industry_production.1971M10 ', 'industry_production.1971M09 ', 'industry_production.1971M08 ', 'industry_production.1971M07 ', 'industry_production.1971M06 ', 'industry_production.1971M05 ', 'industry_production.1971M04 ', 'industry_production.1971M03 ', 'industry_production.1971M02 ', 'industry_production.1971M01 ', 'industry_production.1970M12 ', 'industry_production.1970M11 ', 'industry_production.1970M10 ', 'industry_production.1970M09 ', 'industry_production.1970M08 ', 'industry_production.1970M07 ', 'industry_production.1970M06 ', 'industry_production.1970M05 ', 'industry_production.1970M04 ', 'industry_production.1970M03 ', 'industry_production.1970M02 ', 'industry_production.1970M01 ', 'industry_production.1969M12 ', 'industry_production.1969M11 ', 'industry_production.1969M10 ', 'industry_production.1969M09 ', 'industry_production.1969M08 ', 'industry_production.1969M07 ', 'industry_production.1969M06 ', 'industry_production.1969M05 ', 'industry_production.1969M04 ', 'industry_production.1969M03 ', 'industry_production.1969M02 ', 'industry_production.1969M01 ', 'industry_production.1968M12 ', 'industry_production.1968M11 ', 'industry_production.1968M10 ', 'industry_production.1968M09 ', 'industry_production.1968M08 ', 'industry_production.1968M07 ', 'industry_production.1968M06 ', 'industry_production.1968M05 ', 'industry_production.1968M04 ', 'industry_production.1968M03 ', 'industry_production.1968M02 ', 'industry_production.1968M01 ', 'industry_production.1967M12 ', 'industry_production.1967M11 ', 'industry_production.1967M10 ', 'industry_production.1967M09 ', 'industry_production.1967M08 ', 'industry_production.1967M07 ', 'industry_production.1967M06 ', 'industry_production.1967M05 ', 'industry_production.1967M04 ', 'industry_production.1967M03 ', 'industry_production.1967M02 ', 'industry_production.1967M01 ', 'industry_production.1966M12 ', 'industry_production.1966M11 ', 'industry_production.1966M10 ', 'industry_production.1966M09 ', 'industry_production.1966M08 ', 'industry_production.1966M07 ', 'industry_production.1966M06 ', 'industry_production.1966M05 ', 'industry_production.1966M04 ', 'industry_production.1966M03 ', 'industry_production.1966M02 ', 'industry_production.1966M01 ', 'industry_production.1965M12 ', 'industry_production.1965M11 ', 'industry_production.1965M10 ', 'industry_production.1965M09 ', 'industry_production.1965M08 ', 'industry_production.1965M07 ', 'industry_production.1965M06 ', 'industry_production.1965M05 ', 'industry_production.1965M04 ', 'industry_production.1965M03 ', 'industry_production.1965M02 ', 'industry_production.1965M01 ', 'industry_production.1964M12 ', 'industry_production.1964M11 ', 'industry_production.1964M10 ', 'industry_production.1964M09 ', 'industry_production.1964M08 ', 'industry_production.1964M07 ', 'industry_production.1964M06 ', 'industry_production.1964M05 ', 'industry_production.1964M04 ', 'industry_production.1964M03 ', 'industry_production.1964M02 ', 'industry_production.1964M01 ', 'industry_production.1963M12 ', 'industry_production.1963M11 ', 'industry_production.1963M10 ', 'industry_production.1963M09 ', 'industry_production.1963M08 ', 'industry_production.1963M07 ', 'industry_production.1963M06 ', 'industry_production.1963M05 ', 'industry_production.1963M04 ', 'industry_production.1963M03 ', 'industry_production.1963M02 ', 'industry_production.1963M01 ', 'industry_production.1962M12 ', 'industry_production.1962M11 ', 'industry_production.1962M10 ', 'industry_production.1962M09 ', 'industry_production.1962M08 ', 'industry_production.1962M07 ', 'industry_production.1962M06 ', 'industry_production.1962M05 ', 'industry_production.1962M04 ', 'industry_production.1962M03 ', 'industry_production.1962M02 ', 'industry_production.1962M01 ', 'industry_production.1961M12 ', 'industry_production.1961M11 ', 'industry_production.1961M10 ', 'industry_production.1961M09 ', 'industry_production.1961M08 ', 'industry_production.1961M07 ', 'industry_production.1961M06 ', 'industry_production.1961M05 ', 'industry_production.1961M04 ', 'industry_production.1961M03 ', 'industry_production.1961M02 ', 'industry_production.1961M01 ', 'industry_production.1960M12 ', 'industry_production.1960M11 ', 'industry_production.1960M10 ', 'industry_production.1960M09 ', 'industry_production.1960M08 ', 'industry_production.1960M07 ', 'industry_production.1960M06 ', 'industry_production.1960M05 ', 'industry_production.1960M04 ', 'industry_production.1960M03 ', 'industry_production.1960M02 ', 'industry_production.1960M01 ', 'industry_production.1959M12 ', 'industry_production.1959M11 ', 'industry_production.1959M10 ', 'industry_production.1959M09 ', 'industry_production.1959M08 ', 'industry_production.1959M07 ', 'industry_production.1959M06 ', 'industry_production.1959M05 ', 'industry_production.1959M04 ', 'industry_production.1959M03 ', 'industry_production.1959M02 ', 'industry_production.1959M01 ', 'industry_production.1958M12 ', 'industry_production.1958M11 ', 'industry_production.1958M10 ', 'industry_production.1958M09 ', 'industry_production.1958M08 ', 'industry_production.1958M07 ', 'industry_production.1958M06 ', 'industry_production.1958M05 ', 'industry_production.1958M04 ', 'industry_production.1958M03 ', 'industry_production.1958M02 ', 'industry_production.1958M01 ', 'industry_production.1957M12 ', 'industry_production.1957M11 ', 'industry_production.1957M10 ', 'industry_production.1957M09 ', 'industry_production.1957M08 ', 'industry_production.1957M07 ', 'industry_production.1957M06 ', 'industry_production.1957M05 ', 'industry_production.1957M04 ', 'industry_production.1957M03 ', 'industry_production.1957M02 ', 'industry_production.1957M01 ', 'industry_production.1956M12 ', 'industry_production.1956M11 ', 'industry_production.1956M10 ', 'industry_production.1956M09 ', 'industry_production.1956M08 ', 'industry_production.1956M07 ', 'industry_production.1956M06 ', 'industry_production.1956M05 ', 'industry_production.1956M04 ', 'industry_production.1956M03 ', 'industry_production.1956M02 ', 'industry_production.1956M01 ', 'industry_production.1955M12 ', 'industry_production.1955M11 ', 'industry_production.1955M10 ', 'industry_production.1955M09 ', 'industry_production.1955M08 ', 'industry_production.1955M07 ', 'industry_production.1955M06 ', 'industry_production.1955M05 ', 'industry_production.1955M04 ', 'industry_production.1955M03 ', 'industry_production.1955M02 ', 'industry_production.1955M01 ', 'industry_production.1954M12 ', 'industry_production.1954M11 ', 'industry_production.1954M10 ', 'industry_production.1954M09 ', 'industry_production.1954M08 ', 'industry_production.1954M07 ', 'industry_production.1954M06 ', 'industry_production.1954M05 ', 'industry_production.1954M04 ', 'industry_production.1954M03 ', 'industry_production.1954M02 ', 'industry_production.1954M01 ', 'industry_production.1953M12 ', 'industry_production.1953M11 ', 'industry_production.1953M10 ', 'industry_production.1953M09 ', 'industry_production.1953M08 ', 'industry_production.1953M07 ', 'industry_production.1953M06 ', 'industry_production.1953M05 ', 'industry_production.1953M04 ', 'industry_production.1953M03 ', 'industry_production.1953M02 ', 'industry_production.1953M01']
The column names contain the table name and the actual column name, separated by a dot.
Counting the columns
How many columns are there? Let’s find out by getting the length of the column name list:
print(len(industry_production.columns))
781
Okay, how can I make sense of the column names? Let’s first look at columns 6 to 781: They contain the year (four digits) and the month (two digits), separated by an “M”. So each month between August 2017 and January 1953 has its own column, and we go back in time from left to right. Annoyingly, there is a trailing space in the column names, which is an artefact from ingesting the dataset into the database.
The first five columns are more cryptic. Unfortunately, the EU Open Data Portal seems to hide information about the data format well. Before trying to find out more about the column names, let’s see what kind of values are stored in them.
Basic SQL queries
Retrieving the whole table
To access the data in the table, I use queries in the SQL language. The basic syntax starts with the SELECT
keyword, followed by the column names that are being requested, then the FROM
keyword and the table name (because there are usually more than one table in the database). To select all columns, I use the *
operator. Without specifying any filter, all the rows in the table are selected. This means that to retrieve the whole table, the query would be SELECT * FROM industry_production
.
Counting the rows
However, the table is rather clunky, so let me use a slightly modified query instead, for counting the number of rows in the table. This is done by applying the count()
function, like this:
result = con.execute('SELECT count(*) FROM industry_production')
print(result.fetchall())
[(19197,)]
So there are 19197 rows in the table.
Retrieving only some columns and rows
How do the values in the table look like? To get some idea, let me select the values for the first seven columns (i.e., the five “cryptic” columns, presumably some index) plus the first two data columns. I select the columns explicitly by name. Due to the use of the backslash character, “geo\time” has to be enclosed in quotation marks. The same is the case for the data columns as they start with a numeric character (which actually is bad SQL practice — they should start with a letter) and end with a space, which has to be explicitly enclosed. I also limit the output to the first five rows, using the LIMIT
keyword:
result = con.execute('SELECT indic_bt, nace_r2, s_adj, unit, "geo\\time", "2017M08 ", "2017M07 " FROM industry_production LIMIT 5')
print(result.fetchall())
[('PROD', 'B', 'CA', 'I10', 'AT', ': ', '108.8 p'), ('PROD', 'B', 'CA', 'I10', 'BA', '122.1 ', '115.3 '), ('PROD', 'B', 'CA', 'I10', 'BE', ': ', '102.0 p'), ('PROD', 'B', 'CA', 'I10', 'BG', ': ', '98.0 p'), ('PROD', 'B', 'CA', 'I10', 'CY', ': ', '60.8 p')]
The first five columns apparently have string values, where only the fifth column varies over the first five rows. This lets me suspect that the values are categorical, i.e. are chosen from a finite set of distinct elements.
Understanding the index columns
How many different elements are there for each column? Let’s find out with queries using the DISTINCT
keyword before the column name, and use one query per column.
The “indic_bt” column
I start with the “indic_bt” column:
print(con.execute('SELECT DISTINCT indic_bt FROM industry_production').fetchall())
[('PROD',)]
So the first column has only one value! Looking at the rather confusing metadata documentation on the EU Open Data Portal website, the “indic_bt” column can be related to Section 3.4 therein. “PROD” seems to stand for “production” and might be used to distinguish this dataset from other data products (turnover, volume of sales, etc.). But this info is already contained in the table name and thus the column is redundant.
The “nace_r2” column
What about the “nace_r2” column? Let me look at the alphabetically sorted list:
categories = con.execute('SELECT DISTINCT nace_r2 FROM industry_production').fetchall()
print(sorted(categories))
[('B',), ('B-D',), ('B-D_F',), ('B-D_X_FOOD',), ('B05',), ('B051',), ('B052',), ('B06',), ('B061',), ('B062',), ('B07',), ('B071',), ('B072',), ('B08',), ('B081',), ('B089',), ('B0891',), ('B0892',), ('B0893',), ('B0899',), ('B09',), ('B_C',), ('B_C_X_FD_MIG_NRG',), ('B_C_X_MIG_NRG',), ('C',), ('C10',), ('C10-C12',), ('C101',), ('C1011',), ('C1012',), ('C1013',), ('C102',), ('C103',), ('C1031',), ('C1032',), ('C1039',), ('C104',), ('C1041',), ('C1042',), ('C105',), ('C1051',), ('C1052',), ('C106',), ('C1061',), ('C1062',), ('C107',), ('C1071',), ('C1072',), ('C1073',), ('C108',), ('C1081',), ('C1082',), ('C1083',), ('C1084',), ('C1085',), ('C1086',), ('C1089',), ('C109',), ('C1091',), ('C1092',), ('C10_C11',), ('C11',), ('C1101',), ('C1102',), ('C1103',), ('C1104',), ('C1105',), ('C1106',), ('C1107',), ('C12',), ('C13',), ('C13-C15',), ('C131',), ('C132',), ('C133',), ('C139',), ('C1391',), ('C1392',), ('C1393',), ('C1394',), ('C1395',), ('C1396',), ('C1399',), ('C13_C14',), ('C14',), ('C141',), ('C1411',), ('C1412',), ('C1413',), ('C1414',), ('C1419',), ('C142',), ('C143',), ('C1431',), ('C1439',), ('C15',), ('C151',), ('C1511',), ('C1512',), ('C152',), ('C16',), ('C16-C18',), ('C161',), ('C162',), ('C1621',), ('C1622',), ('C1623',), ('C1624',), ('C1629',), ('C17',), ('C171',), ('C1711',), ('C1712',), ('C172',), ('C1721',), ('C1722',), ('C1723',), ('C1724',), ('C1729',), ('C17_C18',), ('C18',), ('C181',), ('C1811',), ('C1812',), ('C1813',), ('C1814',), ('C182',), ('C19',), ('C191',), ('C192',), ('C20',), ('C201',), ('C2011',), ('C2012',), ('C2013',), ('C2014',), ('C2015',), ('C2016',), ('C2017',), ('C202',), ('C203',), ('C204',), ('C2041',), ('C2042',), ('C205',), ('C2051',), ('C2052',), ('C2053',), ('C2059',), ('C206',), ('C20_C21',), ('C21',), ('C211',), ('C212',), ('C22',), ('C221',), ('C2211',), ('C2219',), ('C222',), ('C2221',), ('C2222',), ('C2223',), ('C2229',), ('C22_C23',), ('C23',), ('C231',), ('C2311',), ('C2312',), ('C2313',), ('C2314',), ('C2319',), ('C232',), ('C233',), ('C2331',), ('C2332',), ('C234',), ('C2341',), ('C2342',), ('C2343',), ('C2344',), ('C2349',), ('C235',), ('C2351',), ('C2352',), ('C236',), ('C2361',), ('C2362',), ('C2363',), ('C2364',), ('C2365',), ('C2369',), ('C237',), ('C239',), ('C2391',), ('C2399',), ('C24',), ('C241',), ('C242',), ('C243',), ('C2431',), ('C2432',), ('C2433',), ('C2434',), ('C244',), ('C2441',), ('C2442',), ('C2443',), ('C2444',), ('C2445',), ('C245',), ('C2451',), ('C2452',), ('C2453',), ('C2454',), ('C24_C25',), ('C25',), ('C251',), ('C2511',), ('C2512',), ('C252',), ('C2521',), ('C2529',), ('C253',), ('C254',), ('C255',), ('C256',), ('C2561',), ('C2562',), ('C257',), ('C2571',), ('C2572',), ('C2573',), ('C259',), ('C2591',), ('C2592',), ('C2593',), ('C2594',), ('C2599',), ('C26',), ('C261',), ('C2611',), ('C2612',), ('C262',), ('C263',), ('C264',), ('C265',), ('C2651',), ('C2652',), ('C266',), ('C267',), ('C268',), ('C26_C27',), ('C27',), ('C271',), ('C2711',), ('C2712',), ('C272',), ('C273',), ('C2731',), ('C2732',), ('C2733',), ('C274',), ('C275',), ('C2751',), ('C2752',), ('C279',), ('C28',), ('C281',), ('C2811',), ('C2812',), ('C2813',), ('C2814',), ('C2815',), ('C282',), ('C2821',), ('C2822',), ('C2823',), ('C2824',), ('C2825',), ('C2829',), ('C283',), ('C284',), ('C2841',), ('C2849',), ('C289',), ('C2891',), ('C2892',), ('C2893',), ('C2894',), ('C2895',), ('C2896',), ('C2899',), ('C29',), ('C291',), ('C292',), ('C293',), ('C2931',), ('C2932',), ('C29_C30',), ('C30',), ('C301',), ('C3011',), ('C3012',), ('C302',), ('C303',), ('C304',), ('C309',), ('C3091',), ('C3092',), ('C3099',), ('C31',), ('C31-C33',), ('C310',), ('C3101',), ('C3102',), ('C3103',), ('C3109',), ('C31_C32',), ('C32',), ('C321',), ('C3211',), ('C3212',), ('C3213',), ('C322',), ('C323',), ('C324',), ('C325',), ('C329',), ('C3291',), ('C3299',), ('C33',), ('C331',), ('C3311',), ('C3312',), ('C3313',), ('C3314',), ('C3315',), ('C3316',), ('C3317',), ('C3319',), ('C332',), ('C_HTC',), ('C_HTC_M',), ('C_LTC',), ('C_LTC_M',), ('D',), ('D35',), ('D351',), ('D352',), ('D353',), ('E36',), ('MIG_CAG',), ('MIG_COG',), ('MIG_COG_X_FOOD',), ('MIG_DCOG',), ('MIG_ING',), ('MIG_ING_CAG',), ('MIG_NDCOG',), ('MIG_NRG_X_D_E',), ('MIG_NRG_X_E',)]
Here I get many cryptic categories that start with the letters “B”, “C”, “D” or “E”, followed by other letters that are separated with a dash or underscore, or numerical characters. The metadata documentation can help here as well (again after some skimming). Section 3.2 and 3.3 say: “NACE Rev.2 classification (Statistical Classification of Economic Activities in the European Community) is used for all the STS indicators, except Industrial Import Prices (…)” and “INDUSTRY: The indicators in this sector cover economic activities listed in sections B to E of NACE (B-Mining and quarrying, C-Manufacturing, D-Electricity, gas, steam and air conditioning supply, E-Water supply; sewerage, waste management and remediation activities”. So the letters are a code for the specific industry sector.
Clicking on the “NACE Rev.2” link produces a list of these sectors, with further information about the numerical characters. For example, under letter “B” we find the following info: “Mining activities are classified into divisions, groups and classes on the basis of the principal mineral produced. Divisions 05, 06 are concerned with mining and quarrying of fossil fuels (coal, lignite, petroleum, gas); divisions 07, 08 concern metal ores, various minerals and quarry products.”
Some categories start with “MIG”. The metadata info gives some info on those ones in Section 18.6: “industrial production: European aggregates and data for LU: total, MIGs (intermediate goods, energy, capital goods, durable consumer goods, non-durable consumer goods)”.
To summarize, the “nace_r2” values are codes for industry (sub-)sectors.
The “s_adj” column
Let’s continue with the third column, “s_adj”:
print(con.execute('SELECT DISTINCT s_adj FROM industry_production').fetchall())
[('NSA',), ('CA',), ('SCA',)]
Here, there are only three different categories. The column name seems to be an abbreviation for “seasonal adjustment” (again Section 18.6). ‘NSA’ apparently stands for “No seasonal adjustment”, ‘CA’ for “Calendar adjustment”, and ‘SCA’ for “Calendar and seasonal adjustment” (This can also be seen from the drop-down menus in the online interactive table version of the dataset.
The “unit” column
The next column is “unit”:
print(con.execute('SELECT DISTINCT unit FROM industry_production').fetchall())
[('I10',), ('PCH_SM',), ('PCH_PRE',)]
Again, there are three distinct values. The interactive table is helpful once more: ‘I10’ means that the value is an index that is normalized to 100 in the year 2010, ‘PCH_SM’ is the percentage change compared to the same month in the previous year, and ‘PCH_PRE’ is the percentage change compared to the previous month.
The “geo\\time” column
This leaves me with the fifth column, “geo\time”:
print(con.execute('SELECT DISTINCT "geo\\time" FROM industry_production').fetchall())
[('IE',), ('ES',), ('EA19',), ('MT',), ('BE',), ('AT',), ('BA',), ('EL',), ('NL',), ('CY',), ('DK',), ('SI',), ('BG',), ('CZ',), ('TR',), ('LT',), ('RS',), ('IT',), ('LU',), ('SK',), ('NO',), ('PT',), ('FI',), ('PL',), ('DE',), ('EE',), ('HU',), ('ME',), ('SE',), ('UK',), ('MK',), ('FR',), ('LV',), ('RO',), ('EU28',), ('HR',)]
These values can be recognized as the geo codes (or country codes) for the 28 EU member states, plus a few other countries (like ‘TR’ for Turkey) and aggregated values for the Euro currency area (‘EA19’) and the whole European Union (‘EU28’). Tables with the geo codes and the country names can be found here.
Hence the “geo” part of the “geo\time” column name. The backslash separates the “geo” part (rows) from the “time” part (columns). Since the column name should just describe the column values, the “\time” part should not be in it.
Combining the index columns
Now that I understand the meaning of the five index columns: How many unique combinations of these columns are there? This I can find out using the DISTINCT keyword once again, but applied to all five columns at once. Using the AS aliasing keyword, I give the name “unique_index_columns” to the result of this query. “unique_index_columns” is a table itself, and so I can apply the count syntax that I used earlier to the “unique_index_columns” table to obtain the number of unique combinations (this is an example for a nested query):
result = con.execute('SELECT count(*) FROM \
(SELECT DISTINCT indic_bt, nace_r2, s_adj, unit, "geo\\time" FROM industry_production) \
AS unique_index_columns')
print(result.fetchall())
[(19197,)]
So there are 19197 unique combinations of the values in the five columns, which matches the total number of rows in the table. This is consistent with the pandas dataframe from which the industry_production table was constructed (see previous project), where the five columns made up a multi-level index. Note that the “indic_bt” column is superfluous though, since it always has the same value.
Example: Extracting manufacturing data for France
At the end, let me now extract a part of this dataset from the database. I will select the manufacturing (“C”) production data index for France, adjusted for calendar and season. These conditions can be specified in SQL with the WHERE keyword and combined with the AND keyword. Note that I have to escape the single quotes with a backslash to tell Python that they are part of the query string and not marking the beginning or end of the string:
query = 'SELECT * FROM industry_production WHERE nace_r2=\'C\' AND s_adj=\'SCA\' AND unit=\'I10\' AND "geo\\time"=\'FR\''
result = con.execute(query)
print(result.fetchall())
[('PROD', 'C', 'SCA', 'I10', 'FR', ': ', '104.4 ', '104.2 ', '105.1 ', '103.0 ', '104.5 ', '101.6 ', '102.3 ', '102.8 ', '104.1 ', '101.3 ', '101.6 ', '103.4 ', '100.5 ', '100.9 ', '101.8 ', '102.5 ', '100.5 ', '102.0 ', '103.2 ', '102.7 ', '102.3 ', '103.2 ', '102.7 ', '103.1 ', '100.2 ', '102.2 ', '101.2 ', '100.8 ', '101.3 ', '100.4 ', '100.3 ', '101.7 ', '98.5 ', '99.8 ', '100.8 ', '98.7 ', '100.9 ', '99.6 ', '97.9 ', '100.5 ', '100.0 ', '100.7 ', '99.7 ', '100.3 ', '100.7 ', '100.4 ', '100.0 ', '99.4 ', '100.7 ', '100.3 ', '101.3 ', '101.3 ', '99.1 ', '99.6 ', '99.4 ', '100.1 ', '99.1 ', '99.3 ', '100.7 ', '103.3 ', '101.7 ', '100.7 ', '100.8 ', '101.2 ', '102.7 ', '100.9 ', '102.3 ', '103.6 ', '104.8 ', '102.4 ', '102.3 ', '102.5 ', '103.8 ', '102.3 ', '106.5 ', '103.0 ', '104.3 ', '105.4 ', '104.8 ', '102.6 ', '101.5 ', '100.6 ', '101.0 ', '99.0 ', '101.0 ', '100.1 ', '100.2 ', '99.6 ', '99.6 ', '97.3 ', '97.7 ', '97.4 ', '98.8 ', '97.3 ', '98.3 ', '96.8 ', '96.0 ', '96.8 ', '95.2 ', '93.8 ', '94.1 ', '95.1 ', '96.2 ', '101.0 ', '102.7 ', '109.2 ', '111.5 ', '111.8 ', '113.8 ', '113.0 ', '113.6 ', '118.2 ', '116.7 ', '118.8 ', '117.3 ', '116.2 ', '116.4 ', '117.7 ', '115.1 ', '119.0 ', '117.8 ', '117.6 ', '118.5 ', '116.2 ', '117.5 ', '116.7 ', '115.8 ', '117.0 ', '114.8 ', '115.7 ', '116.7 ', '115.5 ', '114.9 ', '117.1 ', '117.0 ', '113.8 ', '115.6 ', '113.0 ', '113.7 ', '114.8 ', '116.3 ', '111.9 ', '115.1 ', '112.4 ', '113.0 ', '112.9 ', '113.0 ', '114.6 ', '111.3 ', '113.6 ', '116.0 ', '113.5 ', '113.3 ', '115.5 ', '114.6 ', '111.6 ', '115.4 ', '114.7 ', '113.0 ', '114.0 ', '114.0 ', '115.2 ', '113.2 ', '112.6 ', '111.9 ', '114.0 ', '112.0 ', '112.0 ', '113.0 ', '110.7 ', '111.0 ', '113.9 ', '114.3 ', '114.2 ', '113.8 ', '112.0 ', '114.8 ', '113.3 ', '114.5 ', '116.9 ', '114.5 ', '115.1 ', '115.9 ', '117.1 ', '116.7 ', '116.5 ', '116.0 ', '115.1 ', '115.0 ', '117.2 ', '117.6 ', '119.9 ', '117.0 ', '119.0 ', '118.9 ', '117.8 ', '120.7 ', '120.5 ', '120.1 ', '121.4 ', '119.4 ', '120.0 ', '118.7 ', '117.2 ', '119.1 ', '118.1 ', '119.9 ', '118.3 ', '117.6 ', '117.3 ', '116.8 ', '116.4 ', '117.6 ', '117.9 ', '116.1 ', '113.7 ', '115.6 ', '115.7 ', '114.5 ', '114.2 ', '113.3 ', '113.0 ', '114.7 ', '113.6 ', '114.6 ', '114.4 ', '115.2 ', '114.0 ', '116.1 ', '115.7 ', '116.3 ', '114.9 ', '114.0 ', '115.7 ', '115.5 ', '115.1 ', '113.3 ', '115.5 ', '113.1 ', '115.9 ', '111.3 ', '111.7 ', '111.4 ', '113.0 ', '110.2 ', '109.3 ', '107.2 ', '108.8 ', '108.8 ', '107.6 ', '108.7 ', '109.6 ', '106.5 ', '108.9 ', '109.3 ', '107.7 ', '109.1 ', '107.6 ', '108.6 ', '111.1 ', '108.3 ', '110.2 ', '111.3 ', '110.0 ', '112.0 ', '110.6 ', '110.2 ', '111.4 ', '110.5 ', '111.6 ', '110.9 ', '110.4 ', '111.0 ', '109.7 ', '108.0 ', '108.7 ', '108.4 ', '107.8 ', '108.2 ', '107.5 ', '105.1 ', '105.2 ', '105.6 ', '102.8 ', '103.2 ', '104.8 ', '105.4 ', '104.3 ', '106.1 ', '103.9 ', '104.6 ', '105.0 ', '105.3 ', '106.3 ', '107.3 ', '106.9 ', '108.6 ', '109.0 ', '108.9 ', '111.0 ', '110.3 ', '111.4 ', '113.2 ', '112.3 ', '113.7 ', '112.8 ', '112.9 ', '114.4 ', '115.2 ', '113.2 ', '114.8 ', '114.4 ', '113.1 ', '114.5 ', '112.7 ', '114.6 ', '116.0 ', '115.5 ', '116.9 ', '115.8 ', '114.4 ', '116.3 ', '116.7 ', '115.8 ', '116.8 ', '117.5 ', '118.7 ', '118.3 ', '118.2 ', '119.5 ', '117.4 ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ': ', ':')]
The query result can be directly written into a pandas dataframe with the read_sql_query()
method:
import pandas as pd
df = pd.read_sql_query(query, engine, index_col=['indic_bt', 'nace_r2', 's_adj', 'unit', 'geo\\time'])
df.info()
<class 'pandas.core.frame.DataFrame'> MultiIndex: 1 entries, (PROD, C, SCA, I10, FR) to (PROD, C, SCA, I10, FR) Columns: 776 entries, 2017M08 to 1953M01 dtypes: object(776) memory usage: 6.4+ KB
Note that Pandas has taken care to also extract the column names, but I had to explicitly specify the index columns to combine them into a MultiIndex (otherwise Pandas would have created a RangeIndex).
Conclusion
I connected to a PostgreSQL database out of Python using the SQLAlchemy package and obtained information about the industry_production
table with a table reflection and SQL queries. The rather cryptic index column names and their values could only be understood with the help of the documentation in the EU Open Data Portal, where the dataset originally came from. However, the information is scattered over several webpages, so I had to put some effort into connecting the dots.
In hindsight, it might have been beneficial for me if I had taken greater care when ingesting the dataset into the PostgreSQL database. For example, I could have cleaned the column names and avoided the bad naming scheme of columns beginning with a numeric character (which complicates the queries). On the other hand, this is a good example for the kind of challenges one can face when dealing with third-party data — so this was an excellent exercise.
Code
The project code was written using Jupyter Notebook 5.0.0, running the Python 3.6.1 kernel and Anaconda 4.4.0.
The Jupyter notebook can be found on Github.
Bio
I am a data scientist with a background in solar physics, with a long experience of turning complex data into valuable insights. Originally coming from Matlab, I now use the Python stack to solve problems.
Contact details
Jan Langfellner
contact@jan-langfellner.de
linkedin.com/in/jan-langfellner/
2 thoughts on “Using SQL queries to extract data from a PostgreSQL database”