Automated data retrieval from an online repository to harness free data sources

TL;DR

I obtain monthly EU industry production data from the European Union Open Data Portal through an API.

Long Description

Using the Python requests package, I access the API of the European Union Open Data Portal and search for the monthly EU industry production dataset. I learn which commands to use to send queries to the API, locate the dataset and retrieve metadata of the dataset. From the metadata, I obtain the URL of the dataset and download the data as tab-separated values (tsv file) directly into a pandas dataframe. This approach makes the data import reproducible, which is of paramount importance to ensure a transparent data science process right from the start.

Table of contents

Project Background

The world wide web contains a wealth of information that can be useful for all kinds of data science projects. Often governments, companies or other organizations provide free access to their databases via their websites (a list of great free data sources can be found here, for example). I will demonstrate how Python can be used to access data from an online repository through an API (application programming interface) and store the data in a pandas dataframe. For this example, I will download EU industry data from the European Union Open Data Portal (EU ODP).

The dataset contains monthly industry production data (see here for definition) from 1953 through 2017 for all EU member countries and can also be manually downloaded online. Here, however, the purpose is to obtain the data via a Python script and store it directly in a pandas dataframe. This approach allows me to automate the data retrieval and make it reproducible (as long as the data are still being offered in the online repository) and easy to update. The data access will work through an API.

The documentation of the EU ODP API tells us the URL of the API: http://data.europa.eu/euodp/data/api. This will be the access point for the dataset and its metadata.

Importing the packages

Let’s start with importing the necessary packages to Python:

import requests
import pandas as pd

With the requests package, I can make a connection to the API and perform queries in an easy way.

Pandas is a Python package that allows to store data efficiently in a tabular format and brings many methods to analyze, visualize, manipulate, restructure, or export these data — so I will make use of this package a lot in the upcoming projects.

Connecting to the API

Let’s now connect to the API using the get method and print out the response with the text attribute:

url = 'http://data.europa.eu/euodp/data/api'
response = requests.get(url)
print(response.text)
{"version": 1}

The response is in the JSON (JavaScript Object Notation) format, which delivers results as ‘key: value’ pairs, separated by commas. The whole result is enclosed by curly brackets and readable by humans.

Since I didn’t specify any query, the result is just a simple statement of something having the version of 1, presumably the API.

Finding the dataset through queries

How do I perform a query? The EU ODP API website says that the API is an implementation of the more general CKAN API and refers me to its documentation. There I can learn that a query can be specified by adding '/action/' plus some instruction string to the URL. For example, I can request a list of all available datasets with the dataset_list command. Let’s try it out:

query = '/action/dataset_list'
response = requests.get(url+query)
print(response.text)
{"help": "Return a list of the names of the site's datasets (packages).\n\n    :param limit: if given, the list of datasets will be broken into pages of\n        at most ``limit`` datasets per page and only one page will be returned\n        at a time (optional)\n    :type limit: int\n    :param offset: when ``limit`` is given, the offset to start returning packages from\n    :type offset: int\n\n    :rtype: list of strings\n\n    ", "success": true, "result": ["0026aa70-cc6d-4f6f-8c2f-554a2f9b17f2", "00nlSr3zHd3S6PiCskoXg", "00yyo0vinq079ZH4FcOqw", "01009127-5ddf-4f69-8a6b-30e6218f17bb", "014be465-c941-4ad0-9817-b4de72e19773", "014HVGAGnu32p17RrVj5KQ", "01Al806on2wfDK73I3Zt4Q", "01gR6AIEivlA5S11A3MCA", "01UdtrDlyqeo2JxuPaEw", "01VNHNMrYRAezdyznUwcGA", "02008597-88e9-43d5-bea8-d4371639e13f", "02764dtQ8W5U2bhw10VBug", "027BWD1UDQj476a3D7qaHw", "02893547-8e38-47f5-b214-5fb8a37d9faa", "02UW14x9q6MVX931kxYSLw", "034ff0b2-9ebe-479c-a838-2def29380287", "035bbac8-23e7-4687-b229-5b5a292d2f4b", "039dBmDq4iZvuLjbILX50w", "03c06463-d7cd-4f17-b7b6-ea818b85b6af", "03cJxZRSJLXahrx1ccyng", "03de7eee-495f-480e-b028-03f84f947b19", "045h2PGTBBspSXEl79b0og", "04BtWy7pJluqA0SeyTouSQ", "058ee05e-86b4-4894-8377-81267dbe179a", "05by660gHb47eXmYNn6ObQ", "05PE4YBfxNlZU11b0v3WeA", "05QQFZVkyZlgDtpDd04a1Q", "05tGKVaxeAB5Ps0FClQQ", "069opSeTzWIIqKoHUB1fpw", "06c3f085-c1e3-4228-949d-82a0899b8d7d", "06xNIySdRkP4L8E7ojCoQ", "070313e1-62ca-4425-b5e0-8491d749225f", "077251dc-623b-4211-981a-0b4216b41573", "07eNoJR5YLHx6vF9fCScWQ", "0850da4c-d6bd-478f-9732-368ef7c5b34f", "0862f0ea-5c0f-434f-b1b4-ffae75f5c14a", "08eixS1Xxli7MNTAswHyw", "08HmrONNfD5kmLXKr1hRw", "091y8nIbX7nc84PpJK4mLA", "092253aa-f133-45d2-8a4c-c9e368b8a282", "094dd2fb-7da0-45c9-826e-e9abf5181b3e", "0955d18e-d9aa-49ac-a1ba-1cbb9feab518", "09991d93-0336-4e34-9d3e-5927b9658a3d", "0aa0acbf-360b-4204-96af-16c344e10f83", "0AGwHhUIZw1TjG6sYBbUQ", "0AyFIKleliftBMCHmkj9g", "0AyK7hOAKeT5gwL8uqlkA", "0b43qSuQlf3kkw6NbcNl9A", "0BcLebQNyORtMkuyMhMXQ", "0BEvHIucAIsQ87Jd7S8MqA", "0bGFOH8TKh7XK0T8mIejIQ", "0bkTDZbK1IK9kRMgVkoIVg", "0Bvu3ZQYOpL6bKMTYJEwQ", "0c4916f3-dda1-452f-ac74-9d7a5f3cee91", "0c4pFjNlVLu7Up3PzeqPA", "0c88efca-34b5-4173-8c6f-71d377d8ced9", "0cd0569a-40dd-4ac9-9607-6664ab8621e4", "0ce69f28-b0a4-43f6-8d59-34cc6e47e825", "0CsThGts1mJ1PWgY6raA", "0DFtwvUNwoak60Cjoth4PQ", "0DHwDzBmXdmzOctV41bw", "0DlTmUzIa0DKqKUfCQpQ", "0DSdA1SkeoQslKlmcCI0qA", "0e4f3d28-1803-41bd-ac69-e6855a6bb673", "0e517484-1da4-4c30-a863-1d50380aebab", "0e9Z0d8pIWaDRD93KGoPA", "0eC0pySFXJUmd3ddfhmw", "0EeeJVZp9NnQ4zuZBO7Qw", "0EHGaWtcty4mH5UyXEFtOg", "0ESxqyNDcRWLp5hRMQ4S7Q", "0EWG1CVIihhjzBzpXBQ9A", "0ezRfUTsOzYk2VIVCshMwQ", "0fF1HLaLi9TaD0h34xySAw", "0FIs0X3hw7Z3Y2sKUDz3Q", "0fKpQ4LHEyJd2lhavATbBg", "0fWhsXTPav9Oqe0pSIk60w", "0fZDoANCINPHN6PZpLvwJQ", "0G0AiBHBzVY29qqrO6IKmQ", "0g43rA0OkEFIdFa5hwaYcQ", "0GK3LSmqeMaxnaPwGEM7w", "0Go9WjhluvqX8lp7dVfJAQ", "0gr4alq4bbjaqFk1pzCyfw", "0HCIvr886n3PtZXpxPpQ", "0HJqHMK58Hp8cLFovRINWQ", "0HQZ0pK1yBl8ECqi1EFJFg", "0huBUhw2TIffIIt7kQ0H4w", "0I7bFKnEabRuAz1yM6GNmA", "0i9pslWbvx5O4udujedXA", "0ikDf6NGGf0wbSW1kxfDVg", "0incdFUvqGzlsp2Uw7YSXw", "0IwrIzxxznpq2l0w090aFQ", "0JeKIamUI3HSO5b5y3DrXA", "0Jivy6Wzlv0DQJ5KXM5afQ", "0JtuTJXLFbTCYrJPfGeSjA", "0k5QbBJ0AW2rNMGDx9o59w", "0kaBheCQ3oXi2KRp5YUncg", "0Kc2c7ZUYnsGjOTREVNXg", "0KoBmReheMCk2R963g9A", "0KRah45iDnN55Kr9tbEHA", "0Lb2HIJK7ObuWx1R9t8jg", "0LD0nNZ87M4ccUTXYGJ4Jw", "0ld9o8Hh2fTdkkFDSkvEpQ", "0lpM5slajiPhET8fBbhQ", "0MSzaIs8OpFVODSUYMiJdA", "0mUwC0iW5kZPIlC7iuzFTw", "0NDG8BUzh2EKz5mMZ33ftQ", "0NpMrct2F8SaxVJe4IH4iw", "0oGLeUcN7ujwx1gK3Ui7Aw", "0p2kgeTQWhRKdCSG1NaCAw", "0PX6mDdquzkdIP8Ne2jzPw", "0QeabZr6K1EJ9n57nDCrw", "0QICYBpFZzjJopef7hcQ", "0qJ4Mr5lNtDhhBMwgUrRzA", "0QXCBe6cDjeoulIjcA3m1g", "0r3hV99Wi6DSK5xXm4F9lg", "0s38RFT1zTOhDDZYpBDLzw", "0SUKTMcL8k56px685thSA", "0tpcSCM4OztFluyfJXcsQQ", "0TX5vxBShYRa2gI9l0cUQ", "0u6Zl5d3MSiRUcz1ukCIQ", "0UcaI0ZiORTAWx3KRrETg", "0uLbhyWlEOXmFfaBftm1OA", "0VBw8TMhyyJeGyqaCdI6w", "0vImWjR4DD5L4MQgCN2YA", "0vJJxEO0roaxFfaGImBCw", "0w5g2pCMJXubwxQybOUDeA", "0wELCi19jzWrPirz7ungw", "0WHbVRbuHRiPwtsGDjDA", "0WiHwLBdGbCqMQrgAedMnQ", "0Y156fN4aUCbs329Odbww", "0YNbNrIw7D0QgyAkOg6WQ", "0YS7W9sFHJvHKRo2qOvDIw", "0yziFrBPQHEAdyXoTSOQw", "0Z1B5MiUIhwMTEyJGiTyUg", "0z7JybZRsgo8FGe2k40hgA", "0zktPsH3mPZuwmdXfgrNeg", "10161412-a76c-42b0-b4e1-5fcccdc412b2", "10ienuiMLAaaIa3yI1jqQ", "1109f14e-9fc2-49f1-b640-4a5061f47448", "11da13a8-c590-4309-bc0d-f5ff3b48630a", "11xvtQPrKSOE69On0yB9Q", "12a0090f-a32c-4f7c-adf0-3e6c1316e2fb", "12c969ce-f65b-4941-933c-56f5cc5e9cf6", "12KrrtpsSr9IrX11FRWn6g", "1460a2af-410c-4d47-9ca5-2be21c7a2faf", "14baf291-0075-4144-9169-882d1bc70e7b", "14Hy8FlpcDdELiCtrDTb1w", "14MUxZfR3b8VRVRDZYMWGg", "14PLQoLvxebT89uuhH1TUQ", "1552adcf-4c4c-4cc7-9d4b-435908931481", "15557bf6-827d-449a-ac29-6d39e5517a0e", "1585ede2-cb4a-4ea8-a350-fd8c24939f21", "15d9aZGjZtcxsZYn6u9w", "15VFgw7OfJtposeYfFAAIA", "162a733f-c704-4398-8d0b-e42b29b832cd", "1648f775-4e12-4e60-b50b-f500007cf532", "16Cd66F7inbvtV1HFZnXbw", "16N8n4NEhH7iPyO3zRhWg", "170dc73f-e253-427c-9f48-704df35cf500", "1778kJJJ3lHsX9cPcPjZcw", "17gcEEh2o5v1zCqd2FN07A", "17YhfamQPh8MMwkreiEHg", "18d2ddd2-41f3-46c9-8056-36f78066b95e", "18lQMpvhCb1GW1Q7kWukxA", "191c8984-8863-4752-a429-2bb25dabac25", "19bddaf8-c73f-4af0-bdfa-0adec3ac617d", "1A0xTKYZjQCxZjQSEX1IHQ", "1a3c0TVF44AkkoaTSJ0sbA", "1b468d6a-2138-4ba1-8f62-a6d46446187d", "1b9ff85e-496e-4742-83f4-075aaf01bff7", "1bBus1e2OpGWLJ4f50fNuQ", "1bJMmvKRDWMKvodCkSz9bA", "1bq6DCajAcdYhtZivSnYew", "1bTqrzONb8l4KCq3iM3Ww", "1bZsl5jJM9reGaaRKjw", "1C3s7M8OCelr5LODPb33Ug", "1cCAxoFnUreHzbARC6Fmg", "1cr0v3RHHsYjbwuXVBQRg", "1csVnCaxYKhjvGXUBEchA", "1CtgLZ3oPLHHm3bhbBvqw", "1cX9zbvbkWaIM9nKDFRTPQ", "1czaoXlluHOKl4EQNjrlZQ", "1CzG726G1r91LWjL1A9Vfg", "1D4Tm4fbnfaD7udwhvQ", "1DisxknQWEjgbgrHglnANg", "1E2tJHM1EP6aDUAgsGQE6A", "1e85334b-f14b-4caf-a90e-8185b15c04f3", "1EAqs4HYRrbAPCNa3Kqsog", "1eb99ac9-df2f-451f-8eb9-3e019c1351ef", "1ed24a81-30d2-4cfc-b302-94a407edab00", "1eGaB1iWV1c7hHWfPwLJA", "1eGSWEvhQclzMIgzUxVTcg", "1eHNOm3vNasLvwMBasKGEA", "1ELqnMvQlG9IRaiDgA4Nhg", "1eYHRCqE1Eyt5udCPJHg", "1eYWp4Sf7pDrDoLUL3zghw", "1f3rgcgsrbFVyOFwg2rTlg", "1fdbd415-2f1d-4d8c-a91a-d5cf5e0e34c8", "1FemLsBHLxX68qPylPNsQ", "1FWht6AoxbV5CMLDqj7Qgw", "1gO2s7onyLAMfuFKKjP5DA", "1gtfc3QEGJPC5To2a2FFg", "1gXgb0Yj73R4ttDChQ5Wyg", "1GYfwNKa8eq6lQvYEFUHw", "1HNOPVEKp6zc8p3PofTmQw", "1hONagdzHjNybhRmx3yveg", "1hRqhsBQBE4TELgaR2ivxw", "1htUQjRVIw2e1bE6M8wGaw", "1i2oWRLYQqNrk7kvLLImg", "1I4EEO02IQTDX5IeZW9Z4Q", "1i9MrjnMzaXU2vQG64u2hQ", "1Ikiyjib37tg6VAX4GQzw", "1iTgzmoy934kwiTxJG8Q", "1iUJdCgqGp9QQe3MkTBRlg", "1JHgJKnq9IvYagiPszOHg", "1JmnH3ZCIAcySgoxJhpGQ", "1JqKZ3WbA7ey4W2xE9U4rQ", "1JWeorKn95EmREJKdtiP4A", "1jZLuEwsqVO7b7vafG4sQ", "1K3GFEgUAwbwBVdCZg3g6w", "1KKEGGBWwoBhquUi7aKbBw", "1kKWpmv3pNX3HuvZabZ0Bg", "1kM3ttLDimQZ7hsNYeQ2Cw", "1kMtoMo04UQXAotTS9YLQ", "1KNEJIFSG7jHcPFDk61Vcg", "1KpD2tEE2xilj46X3sEDw", "1l9t6FgxhcB9PfZZ8bTpeQ", "1laC5Mm8ON1yHVuRGaNUyg", "1M5i15ts14K3P8LM5KlbQ", "1MvP9CbKGiqc6bAmJkZAw", "1MY2zqGF0SVJQOSHBm63IA", "1N4voMyq1wU1L7kMsKJOA", "1NDVlD1bRrpOgLhDd6rGEw", "1nGg2Xr7IwpvWClezJGFEQ", "1NNt4WKh3PkKjSixMbzOow", "1obHi0FADlY0oIEHpPZDBg", "1OEJUOBj20AeH2oP5GFObw", "1OKHotbVOmcCHJedQaJA", "1OsHTGl88ppeOfAh3PLnA", "1pdEHxMf8Q9YeXGyF1PYHQ", "1pINeSMDTux81h8Fual5XA", "1pjNsn20B1IFwMS9j8y5A", "1pKG33rJlWTgAtbRq1TmIA", "1q8Bgy4l7VDBgk3pUK5eqA", "1qplcVwmQhCHtawHFiT03Q", "1qPYHyLAljvk9poiUBSliw", "1RenennHOFdExF7ULLZ8A", "1rI2fpIrgQ6gAqVY1nTQ", "1RYS6LkaHbBVntb0AUKPuQ", "1RyYWBQimbgb5PL5X8y5hw", "1sLdeqGJedQNHLGtjw5iA", "1sMEPEJwKLPqPtCl7WdxEw", "1ST5QDGkrC6jfEcd1NM7Q", "1Su74DGNnm1dPJYFBtr9A", "1TwVcapyPjqKRssnhnAVw", "1tZmA9uqQ1YrIBbiCfNmjA", "1uaH83b3JAj9m7EpcH7hg", "1uDQT9bJftP6r3RLmTrPXw", "1uky56mVg9zIKWL3VwTdiQ", "1ULu4DvwiXz8wDbNOO1w7w", "1vEapGtE3pbjUlzg4yulg", "1VI7l5bEFsT5VMEBbs3K6w", "1vRVBOSQ1tS61K3JJlJUCw", "1wLz1WhwNUUpumWCgg8A", "1wvoo8mqOoMzNMIPIQFfA", "1WXBrG2qIdXbFoTzDk2QqA", "1wz897bGlmMuLwV1VDUrPg", "1x2z2IJTNyZ6ixMhkJOgg", "1X3d9bfhzwGDJQ66RYHoSQ", "1x6IstsQiuk93Z3rgayQ", "1XE6vBbzoGRGry1NIUeg", "1xEkOQMcFXDJzsb3O8paA", "1xo88QpFmfqtEdLfTuMsQ", "1yCCilmbry5QlwtaWf7G6w", "1YCqFQCdY7masvcJOeLpQ", "1YW7O2Uvg5LVnj2D5afog", "1ywD3KYxowN1otlNji0law", "1yY1g5uPZup1UlGSFI4Jw", "1Z6nRZ8H4pNcmVpnoHQlRg", "1zASOjjDFrzr6rnXvfF9hw", "1ZBVRgrdsxdh9BUsmv0PVQ", "1zBXUxNEUYJz42nl8dm0w", "1zdSlQgB76wOeZFQAD76pg", "1ZECfkedAFCrRn9lcujEJg", "1ZMNuqQZtr4LnJSgCFxTw", "1Zos8RNbabSUB1DZgJvXmw", "1ZSKJzEWtV8VWfhJepfwg", "1zW3i0tnLBGEBbUgXcDv5g", "2007-how-europeans-see-the-european-parliament", "2008-climate-change", "2009-climate-change", "2009-eurobarometer-european-elections", "2009-eurobarometer-european-elections-i", "2009-europeans-at-one-month-from-the-elections", "2009-post-electoral-survey", "2009-the-ep-and-the-expectations-of-european-citizens", "2009-women-and-the-european-elections", "2010-the-ep-and-the-expectations-of-european-citizens", "2011-100th-anniversary-of-the-international-women-s-day-bridging-the-gender-gap-in-the-eu", "2011-europeans-and-energy-i", "2011-europeans-and-energy-ii", "2011-europeans-and-the-crisis-i", "2011-europeans-and-the-crisis-ii", "2011-europeans-and-the-crisis-iii", "2011-europeans-and-the-crisis-iv", "2011-european-year-of-volunteering", "2011-poverty-and-exclusion-amongst-senior-eu-citizens", "2011-the-european-ombudsman-and-citizens-rights", "2012-crisis-and-economic-governance-v", "2012-the-gender-inequalities-in-the-european-union", "2015-desk-research-post-ee2014", "2015-major-changes-in-european-public-opinion-with-regard-to-the-eu", "2015-national-opinion-trends", "2016-ep-eurobarometer-perceptions-and-expectations-the-fight-against-terrorism-and-radicalisation", "201a449a-6342-4406-8e96-ca3c14882464", "20749df9-f622-4f54-8792-fc6b144aca02", "20LhAn850164O2D3UcaA", "21012c45-0433-41d6-910e-24a2e72cf35a", "2130da1b-d6c7-43c0-b695-9b32b59dc199", "21612774-34cb-4d89-97bf-27f5998f17fd", "218933e7-0511-4d1a-939d-9b5d5513e83f", "21dec055-da40-4efb-8c39-7013dd321f5c", "21JXAdhEnFDObxprUWbZQ", "2253eafa-eb4b-4373-8c93-f851e35b6aed", "22idWBRnfX1lZY3gupjA", "24254db1-79ab-4157-bb7a-ba2a7da9081c", "24362f17-7935-4a27-8cac-7a6e5b37bae8", "2440c32f-2214-4411-95e6-d404bca4fae2", "24vVIQaIrXyUGmNohrC8Q", "2606c8e8-135c-41ab-ae60-77ae7fa09d1d", "262f691a-8a2a-412e-bdc9-de57cce42da2", "272f6b7a-ee81-4678-89d5-cb327f754dc5", "274fb320-0d3d-48fd-a3ce-64fdffd0f3f2", "276cd873-1a2c-4953-9b43-354755d903ba", "278a5b7e-82be-4680-b872-3f3142bfce40", "2792b966-06fd-4336-8dec-bb442bc659d9"]}

Whoa, these are indeed a lot of datasets! (Note: I truncated the output for this post; the entire list has 11258 entries.) The key-value structure can be nicely seen, but the names are mostly cryptic. How do I find out if the desired dataset exists? From the dataset webpage, I know that the name of the dataset is “sts_inpr_m”. However, it cannot be found in that list.

Let’s try a search query instead! This is done with the dataset_search command, followed by a question mark, the characters 'fq=' and the search string. “fq” stands for filter query and lets us perform complex queries (see https://wiki.apache.org/solr/CommonQueryParameters), though here I just want to find datasets with the matching string value. In addition, I convert the returned JSON into a Python dictionary by applying the json() method on the response:

query = '/action/dataset_search?fq=sts_inpr_m'
response = requests.get(url+query)
response_dict = response.json()
print(response_dict)
{'help': '\n    Searches for packages satisfying a given search criteria.\n\n    This action accepts solr search query parameters (details below), and\n    returns a dictionary of results, including dictized datasets that match\n    the search criteria, a search count and also facet information.\n\n    **Solr Parameters:**\n\n    For more in depth treatment of each paramter, please read the `Solr\n    Documentation <http://wiki.apache.org/solr/CommonQueryParameters>`_.\n\n    This action accepts a *subset* of solr\'s search query parameters:\n\n\n    :param q: the solr query.  Optional.  Default: `"*:*"`\n    :type q: string\n    :param fq: any filter queries to apply.  Note: `+site_id:{ckan_site_id}`\n        is added to this string prior to the query being executed.\n    :type fq: string\n    :param sort: sorting of the search results.  Optional.  Default:\n        \'relevance asc, metadata_modified desc\'.  As per the solr\n        documentation, this is a comma-separated string of field names and\n        sort-orderings.\n    :type sort: string\n    :param rows: the number of matching rows to return.\n    :type rows: int\n    :param start: the offset in the complete result for where the set of\n        returned datasets should begin.\n    :type start: int\n    :param facet: whether to enable faceted results.  Default: "true".\n    :type facet: string\n    :param facet.mincount: the minimum counts for facet fields should be\n        included in the results.\n    :type facet.mincount: int\n    :param facet.limit: the maximum number of values the facet fields return.\n        A negative value means unlimited. This can be set instance-wide with\n        the :ref:`search.facets.limit` config option. Default is 50.\n    :type facet.limit: int\n    :param facet.field: the fields to facet upon.  Default empty.  If empty,\n        then the returned facet information is empty.\n    :type facet.field: list of strings\n\n\n    The following advanced Solr parameters are supported as well. Note that\n    some of these are only available on particular Solr versions. See Solr\'s\n    `dismax`_ and `edismax`_ documentation for further details on them:\n\n    ``qf``, ``wt``, ``bf``, ``boost``, ``tie``, ``defType``, ``mm``\n\n\n    .. _dismax: http://wiki.apache.org/solr/DisMaxQParserPlugin\n    .. _edismax: http://wiki.apache.org/solr/ExtendedDisMax\n\n\n    **Results:**\n\n    The result of this action is a dict with the following keys:\n\n    :rtype: A dictionary with the following keys\n    :param count: the number of results found.  Note, this is the total number\n        of results found, not the total number of results returned (which is\n        affected by limit and row parameters used in the input).\n    :type count: int\n    :param results: ordered list of datasets matching the query, where the\n        ordering defined by the sort parameter used in the query.\n    :type results: list of dictized datasets.\n    :param facets: DEPRECATED.  Aggregated information about facet counts.\n    :type facets: DEPRECATED dict\n    :param search_facets: aggregated information about facet counts.  The outer\n        dict is keyed by the facet field name (as used in the search query).\n        Each entry of the outer dict is itself a dict, with a "title" key, and\n        an "items" key.  The "items" key\'s value is a list of dicts, each with\n        "count", "display_name" and "name" entries.  The display_name is a\n        form of the name that can be used in titles.\n    :type search_facets: nested dict of dicts.\n    :param use_default_schema: use default package schema instead of\n        a custom schema defined with an IDatasetForm plugin (default: False)\n    :type use_default_schema: bool\n\n    An example result: ::\n\n     {\'count\': 2,\n      \'results\': [ {  }, {  }],\n      \'search_facets\': {u\'tags\': {\'items\': [{\'count\': 1,\n                                             \'display_name\': u\'tolstoy\',\n                                             \'name\': u\'tolstoy\'},\n                                            {\'count\': 2,\n                                             \'display_name\': u\'russian\',\n                                             \'name\': u\'russian\'}\n                                           ]\n                                 }\n                       }\n     }\n\n    **Limitations:**\n\n    The full solr query language is not exposed, including.\n\n    fl\n        The parameter that controls which fields are returned in the solr\n        query cannot be changed.  CKAN always returns the matched datasets as\n        dictionary objects.\n    ', 'success': True, 'result': {'count': 1, 'sort': 'title_string_en asc', 'facets': {}, 'results': [{'owner_org': '9c0d04b1-a494-4f70-80b2-54024b6f8b0c', 'maintainer': None, 'relationships_as_object': [], 'temporal_coverage_to': '2017-07', 'private': False, 'maintainer_email': None, 'num_tags': 2, 'geographical_coverage': [], 'contact_name': 'Eurostat, the statistical office of the European Union', 'id': '34caa835-f669-45eb-a2c4-4396550483bc', 'metadata_created': '2015-10-16T13:29:41.432353', 'modified_date': '2017-09-19', 'capacity': None, 'metadata_modified': '2017-09-20T05:18:05.148812', 'author': None, 'author_email': None, 'type_of_dataset': ['http://data.europa.eu/euodp/kos/dataset-type/Statistical'], 'interoperability_level': [], 'state': 'active', 'version': None, 'concepts_eurovoc': [], 'license_id': 'http://data.europa.eu/euodp/kos/licence/EuropeanCommission', 'type': 'dataset', 'resources': [{'mimetype': None, 'cache_url': None, 'hash': '', 'description': 'Download dataset in TSV format', 'name': None, 'format': 'application/zip', 'url': 'http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/sts_inpr_m.tsv.gz', 'created': '2017-09-20T07:18:05.234412', 'state': 'active', 'webstore_url': None, 'revision_timestamp': '2017-09-20T05:18:05.143313', 'tracking_summary': {'total': 0, 'recent': 0}, 'mimetype_inner': None, 'download_total_resource': 0, 'url_type': None, 'position': 3, 'resource_group_id': '11d6cb62-8ab1-493e-910c-7481361d2fdc', 'revision_id': 'bc2f3973-5b84-4dec-9490-0b6f91080c6c', 'id': 'aad886d6-480c-43be-a05e-36a81124b6d6', 'resource_type': 'http://www.w3.org/TR/vocab-dcat#Download', 'size': None}, {'mimetype': None, 'cache_url': None, 'hash': '', 'description': 'Download dataset in TSV format (unzipped)', 'name': None, 'format': 'text/tab-separated-values', 'url': 'http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/sts_inpr_m.tsv.gz&unzip=true', 'created': '2017-09-20T07:18:05.234417', 'state': 'active', 'webstore_url': None, 'revision_timestamp': '2017-09-20T05:18:05.143313', 'tracking_summary': {'total': 2, 'recent': 0}, 'mimetype_inner': None, 'download_total_resource': 0, 'url_type': None, 'position': 4, 'resource_group_id': '11d6cb62-8ab1-493e-910c-7481361d2fdc', 'revision_id': 'bc2f3973-5b84-4dec-9490-0b6f91080c6c', 'id': '007550f6-caa0-44aa-a1e3-38f44c82832c', 'resource_type': 'http://www.w3.org/TR/vocab-dcat#Download', 'size': None}, {'mimetype': None, 'cache_url': None, 'hash': '', 'description': 'Download dataset in SDMX-ML format', 'name': None, 'format': 'application/zip', 'url': 'http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/sts_inpr_m.sdmx.zip', 'created': '2017-09-20T07:18:05.234423', 'state': 'active', 'webstore_url': None, 'revision_timestamp': '2017-09-20T05:18:05.143313', 'tracking_summary': {'total': 0, 'recent': 0}, 'mimetype_inner': None, 'download_total_resource': 0, 'url_type': None, 'position': 5, 'resource_group_id': '11d6cb62-8ab1-493e-910c-7481361d2fdc', 'revision_id': 'bc2f3973-5b84-4dec-9490-0b6f91080c6c', 'id': '575288b8-bc1d-4a61-9285-84224286ae8c', 'resource_type': 'http://www.w3.org/TR/vocab-dcat#Download', 'size': None}, {'mimetype': None, 'cache_url': None, 'hash': '', 'description': 'sts_esms.htm', 'name': 'ESMS metadata (Euro-SDMX Metadata structure) HTML', 'format': 'text/html', 'url': 'http://ec.europa.eu/eurostat/cache/metadata/en/sts_esms.htm', 'created': '2017-09-20T07:18:05.234382', 'state': 'active', 'webstore_url': None, 'revision_timestamp': '2017-09-20T05:18:05.143313', 'tracking_summary': {'total': 16, 'recent': 0}, 'mimetype_inner': None, 'download_total_resource': 0, 'url_type': None, 'position': 0, 'resource_group_id': '11d6cb62-8ab1-493e-910c-7481361d2fdc', 'revision_id': 'bc2f3973-5b84-4dec-9490-0b6f91080c6c', 'id': '0a1db225-967e-4aa3-930b-6fc11a6b011f', 'resource_type': 'http://data.europa.eu/euodp/kos/documentation-type/MainDocumentation', 'size': None}, {'mimetype': None, 'cache_url': None, 'hash': '', 'description': '', 'name': 'ESMS metadata (Euro-SDMX Metadata structure) SDMX', 'format': '', 'url': 'http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=metadata/sts_esms.sdmx.zip', 'created': '2017-09-20T07:18:05.234398', 'state': 'active', 'webstore_url': None, 'revision_timestamp': '2017-09-20T05:18:05.143313', 'tracking_summary': {'total': 1, 'recent': 0}, 'mimetype_inner': None, 'download_total_resource': 0, 'url_type': None, 'position': 1, 'resource_group_id': '11d6cb62-8ab1-493e-910c-7481361d2fdc', 'revision_id': 'bc2f3973-5b84-4dec-9490-0b6f91080c6c', 'id': '318f8ac0-703a-4ba2-8588-7ef9378b3034', 'resource_type': 'http://data.europa.eu/euodp/kos/documentation-type/RelatedWebPage', 'size': None}, {'mimetype': None, 'cache_url': None, 'hash': '', 'description': '', 'name': 'More information on Eurostat Website', 'format': '', 'url': 'http://ec.europa.eu/eurostat/web/products-datasets/-/sts_inpr_m', 'created': '2017-09-20T07:18:05.234405', 'state': 'active', 'webstore_url': None, 'revision_timestamp': '2017-09-20T05:18:05.143313', 'tracking_summary': {'total': 0, 'recent': 0}, 'mimetype_inner': None, 'download_total_resource': 0, 'url_type': None, 'position': 2, 'resource_group_id': '11d6cb62-8ab1-493e-910c-7481361d2fdc', 'revision_id': 'bc2f3973-5b84-4dec-9490-0b6f91080c6c', 'id': 'cd7bcffc-6145-4f9e-9fd1-e61a7e6ed25d', 'resource_type': 'http://data.europa.eu/euodp/kos/documentation-type/RelatedWebPage', 'size': None}], 'status': ['http://data.europa.eu/euodp/kos/dataset-status/Completed'], 'num_resources': 6, 'description': 'Production in industry - monthly data', 'temporal_coverage_from': '1953-01', 'language': [], 'tracking_summary': {'total': 626, 'recent': 2}, 'groups': [{'display_name': 'http://eurovoc.europa.eu/100147', 'description': '', 'image_display_url': '', 'title': 'http://eurovoc.europa.eu/100147', 'type': 'eurovoc_domain', 'id': '0b3e112c-4a87-450a-8212-dd997bd8ffaf', 'name': 'eurovoc_domain_100147'}, {'display_name': 'http://eurovoc.europa.eu/100152', 'description': '', 'image_display_url': '', 'title': 'http://eurovoc.europa.eu/100152', 'type': 'eurovoc_domain', 'id': '08994bb7-2827-4215-a10b-d0f520fd70bb', 'name': 'eurovoc_domain_100152'}, {'display_name': 'http://eurovoc.europa.eu/100160', 'description': '', 'image_display_url': '', 'title': 'http://eurovoc.europa.eu/100160', 'type': 'eurovoc_domain', 'id': '23bb695d-0233-4ed9-bfbb-38e1bf181b80', 'name': 'eurovoc_domain_100160'}], 'creator_user_id': '75d0190a-2f12-40ed-a863-b5bf30990d14', 'relationships_as_subject': [], 'license_title': 'Legal Notice', 'revision_timestamp': '2016-05-03T10:06:49.773481', 'organization': {'description': '', 'created': '2012-12-12T13:50:52.627310', 'title': 'Eurostat', 'name': 'estat', 'revision_timestamp': '2012-12-12T12:50:52.601223', 'is_organization': True, 'state': 'active', 'image_url': '', 'revision_id': 'c9dbca7b-695f-4247-a5b5-0ae1818f9d53', 'type': 'organization', 'id': '9c0d04b1-a494-4f70-80b2-54024b6f8b0c', 'approval_status': 'approved'}, 'name': 'dZzomwHlfy7S3KYLnVaSLg', 'isopen': True, 'url': 'http://ec.europa.eu/eurostat/web/products-datasets/-/sts_inpr_m', 'temporal_granularity': [], 'contact_address': 'Joseph Bech building, 5 Rue Alphonse Weicker, L-2721 Luxembourg', 'contact_telephone': ' +352 430136789', 'rdf': '\n  \n    \n    \n    \n    Production in industry - monthly data\n    Production dans l\'industrie - données mensuelles\n    Produktion in der Industrie - monatliche Daten\n    sts_inpr_m\n    Production in industry - monthly data\n    Production dans l\'industrie - données mensuelles\n    Produktion in der Industrie - monatliche Daten\n    \n      \n    \n    \n      \n        \n      \n      ESMS metadata (Euro-SDMX Metadata structure) HTML\n      http://ec.europa.eu/eurostat/cache/metadata/en/sts_esms.htm\n    \n    \n      \n        \n      \n      ESMS metadata (Euro-SDMX Metadata structure) SDMX\n      http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=metadata/sts_esms.sdmx.zip\n    \n    \n      \n        \n      \n      More information on Eurostat Website\n      Plus dinformation sur le site web dEurostat\n      Mehr Informationen auf der Eurostat Website\n      http://ec.europa.eu/eurostat/web/products-datasets/-/sts_inpr_m\n    \n    \n      http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/sts_inpr_m.tsv.gz\n      \n      Download dataset in TSV format\n      application/x-gzip\n    \n    \n      http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/sts_inpr_m.tsv.gz&unzip=true\n      \n      Download dataset in TSV format (unzipped)\n      text/tab-separated-values\n    \n    \n      http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/sts_inpr_m.sdmx.zip\n      \n      Download dataset in SDMX-ML format\n      application/zip\n    \n    2017-09-19\n    \n      \n    \n    \n      \n        \n        \n        Joseph Bech building, 5 Rue Alphonse Weicker, L-2721 Luxembourg\n        Eurostat, the statistical office of the European Union\n      \n    \n    \n      \n    \n    \n      \n    \n    \n      \n        1953-01\n        2017-07\n      \n    \n  \nhttp://ec.europa.eu/eurostat/web/products-datasets/-/sts_inpr_m2017-09-19T09:43:23.7050912015-10-16T13:29:41.432353', 'license_url': 'http://ec.europa.eu/geninfo/legal_notices_en.htm', 'contact_webpage': 'http://ec.europa.eu/eurostat/help/support', 'title': 'Production in industry - monthly data', 'revision_id': '2b9c1c9f-e230-45ce-8f2d-34128e25fbb6', 'identifier': 'sts_inpr_m'}], 'search_facets': {}}}

Aha, now we are getting something! It is still quite a complex dictionary though, with several sub-layers (dictionaries of dictionaries). There is only one matching dataset, whose metadata are hidden in "response_dict['result']['results'][0]". Where does the string 'sts_inpr_m' appear? Let’s iterate over all keys and search for a match:

found_dataset_dict = response_dict['result']['results'][0]
for key, value in found_dataset_dict.items():
    if value=='sts_inpr_m':
        print(key)
identifier

Okay, so ‘sts_inpr_m’ is not the name of the dataset, but its “identifier”! But – what IS its name after all?

print(found_dataset_dict['name'])
dZzomwHlfy7S3KYLnVaSLg

Yikes! No wonder I didn’t find it… A more useful, human-readable name is the title:

print(found_dataset_dict['title'])
Production in industry - monthly data

Downloading the dataset

This is much better indeed. Now that I can access the metadata – how do I get the actual data? Surprisingly, I cannot use the API to retrieve specific parts of the data in JSON format but have to download the whole dataset as a file. The available download options are listed in a convoluted way in the metadata, more specifically in the 'resource' key. The closest thing to what I want is the data as tab-separated values (TSV) in resource No. 1:

print(found_dataset_dict['resources'][1])
{'mimetype': None, 'cache_url': None, 'hash': '', 'description': 'Download dataset in TSV format (unzipped)', 'name': None, 'format': 'text/tab-separated-values', 'url': 'http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/sts_inpr_m.tsv.gz&unzip=true', 'created': '2017-09-20T07:18:05.234417', 'state': 'active', 'webstore_url': None, 'revision_timestamp': '2017-09-20T05:18:05.143313', 'tracking_summary': {'total': 2, 'recent': 0}, 'mimetype_inner': None, 'download_total_resource': 0, 'url_type': None, 'position': 4, 'resource_group_id': '11d6cb62-8ab1-493e-910c-7481361d2fdc', 'revision_id': 'bc2f3973-5b84-4dec-9490-0b6f91080c6c', 'id': '007550f6-caa0-44aa-a1e3-38f44c82832c', 'resource_type': 'http://www.w3.org/TR/vocab-dcat#Download', 'size': None}

The URL can be found in the 'url' key. With this information, I can load the data from the online TSV file directly into a Pandas dataframe, using the read_csv method. However, for the method to work properly, I need to specify that the separators can be tabs OR commas (in the form of a regular expression, where | stands for the OR). This is because the index columns are separated by commas, and the data columns are separated by tabs. Also, because the first five columns of the file belong to the index, I need to tell this to pandas by passing these column indices as a list to the index_col parameter. Then the dataframe is created with a correct MultiIndex:

dataset_tsv_url = found_dataset_dict['resources'][1]['url']
print(dataset_tsv_url)
df = pd.read_csv(dataset_tsv_url, sep='\t|,', index_col=[0,1,2,3,4])
print(df.head(3)) # printing the first three rows of the data frame
http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/sts_inpr_m.tsv.gz&unzip=true

                                     2017M07  2017M06  2017M05  2017M04   \
indic_bt nace_r2 s_adj unit geo\time                                       
PROD     B       CA    I10  AT             :   108.5 p   110.8    114.1    
                            BA         115.4    116.8    109.6    107.0    
                            BE             :   111.4 p  110.5 p  110.3 p   

                                     2017M03  2017M02  2017M01  2016M12   \
indic_bt nace_r2 s_adj unit geo\time                                       
PROD     B       CA    I10  AT         113.0     92.3    100.4     86.4    
                            BA         123.2    104.5     91.3    120.9    
                            BE        119.3 p   97.1 p   82.0 p   94.3 p   

                                     2016M11  2016M10    ...   1953M10   \
indic_bt nace_r2 s_adj unit geo\time                     ...              
PROD     B       CA    I10  AT         111.3    107.2    ...         :    
                            BA         119.3    127.3    ...         :    
                            BE        120.3 p  124.8 p   ...         :    

                                     1953M09  1953M08  1953M07  1953M06   \
indic_bt nace_r2 s_adj unit geo\time                                       
PROD     B       CA    I10  AT             :        :        :        :    
                            BA             :        :        :        :    
                            BE             :        :        :        :    

                                     1953M05  1953M04  1953M03  1953M02   \
indic_bt nace_r2 s_adj unit geo\time                                       
PROD     B       CA    I10  AT             :        :        :        :    
                            BA             :        :        :        :    
                            BE             :        :        :        :    

                                     1953M01  
indic_bt nace_r2 s_adj unit geo\time          
PROD     B       CA    I10  AT             :  
                            BA             :  
                            BE             :  

[3 rows x 775 columns]
print(df.info())  # printing infos about the structure and size of the data frame
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 3017 entries, (PROD, B, CA, I10, AT) to (PROD, C107, SCA, I10, FR)
Columns: 775 entries, 2017M07  to 1953M01
dtypes: object(775)
memory usage: 17.9+ MB
None

The data columns give the monthly industry production index values from July 2017 all the way back to January 1953, albeit with many missing entries (denoted by the colons).

There is an additional problem here that is not that obvious. The df.info() shows that there are only 3017 rows. However, a manual download of the zipped file (without the &unzip=true in the URL) and inspection in Notepad++ revealed that there should be 19198 rows (including the header row). What happened to all the other rows? Well, apparently pandas works well, but there is a bug in the EU ODP.

A workaround is getting rid of the &unzip=true part of the URL and let pandas do the unzipping by adding compression='gzip' to the read_csv() command:

print(dataset_tsv_url[:-11]) # Trimming the URL
df = pd.read_csv(dataset_tsv_url[:-11], compression='gzip', sep='\t|,', index_col=[0,1,2,3,4], engine='python')
print(df.info())
http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/sts_inpr_m.tsv.gz
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 19197 entries, (PROD, B, CA, I10, AT) to (PROD, MIG_NRG_X_E, SCA, PCH_PRE, UK)
Columns: 776 entries, 2017M08  to 1953M01
dtypes: object(776)
memory usage: 113.8+ MB
None

This looks much better!

Since the purpose of this project was to just import the data though, I am finished here!

Conclusion

There is a lot of data out there waiting to be discovered and analyzed. Often public databases can be accessed via API’s, which might in theory be convenient to automatically retrieve data. However, the presented example featuring data from the European Union Open Data Portal showed some pitfalls. In practice, it can be laborious to compile all necessary information about the API and the dataset. For instance, the documentation of the API is spread over several webpages and the dataset has different identifier keys (“name”, “title”, “identifier”), which are easy to confuse. Moreover, the dataset itself cannot be accessed via the API (to retrieve only selected parts) but has to be downloaded as a whole in form of a file. The decompression option for the file download led to a truncated dataset.

For the next time, it might be instructive to choose a data source that allows to retrieve parts of a dataset directly via the API through more complex queries. Apparently, a better documented access to the data is provided by Eurostat. On the other hand, dealing with this dataset through the EU Open Data Portal demonstrated which kind of challenges one has to be prepared to face when working with data from the web.

Code

The project code was written using Jupyter Notebook 5.0.0, running the Python 3.6.1 kernel and Anaconda 4.4.0.

The Jupyter notebook can be found on Github.

Bio

I am a data scientist with a background in solar physics, with a long experience of turning complex data into valuable insights. Originally coming from Matlab, I now use the Python stack to solve problems.

Contact details

Jan Langfellner
contact@jan-langfellner.de
linkedin.com/in/jan-langfellner/

PhD thesis

Impressum

2 thoughts on “Automated data retrieval from an online repository to harness free data sources

Leave a Reply