Making the numbers shine: Cleaning EU industry production index values

TL;DR

I make EU industry production index values, which I previously put in a tidy form, ready for analysis by splitting numbers and flag values with pandas methods.

Long Description

Now that the EU industry production dataset has a tidy dataframe structure, I clean up the production index values. The values are stored as strings and may contain additional flag letters; missing values are denoted with colons. Information about the meaning of the flags can be drawn from the online documentation at the original data source. I identify the flags with the help of regular expressions and the pandas “re” package and store the values as categorical data in a new column with the pandas “apply” method. With another set of regular expressions, I extract the actual numbers from the production index column and convert them to floating-point values; the missing data are stored as NaN values. After this treatment, the production index values are ready for an actual data analysis.

Table of contents

Project Background

In the previous project, I dealt with the structure of the EU industry production dataframe, i.e. the rows and columns. Here, the focus is on cleaning the actual production index values. This is necessary to allow for an easy analysis of the dataset using pandas methods and functions from Python visualization packages (e.g., matplotlib, Seaborn).

Inspecting the dataframe

Tidy dataframe structure

I start with the dataframe in the tidy shape from the previous project. As a reminder, this is how the dataframe looks like:

df.info()
print(df.head())

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 27936 entries, (1953-01-01 00:00:00, AT) to (2017-08-01 00:00:00, UK)
Data columns (total 1 columns):
production_index    27936 non-null object
dtypes: object(1)
memory usage: 306.5+ KB

                        production_index
time       country_code                 
1953-01-01 AT                          :
           BA                          :
           BE                          :
           BG                          :
           CY                          :

In the tidy form, each row is an observation and each column a variable.

Production index values

The production_index column consists of string (non-null object) values. In the head, there are just colons though.

To get more useful insights into the possible values, let’s have a look at the most recent production index values for Austria (country code ‘AT’) – say the last year. Because I want to specify a range in the time component, I have to apply slice() with the beginning and the end (inclusive) of the period inside the loc() method:

print(df.loc[(slice('2016-09','2017-08'),'AT'),:])

                        production_index
time       country_code                 
2016-09-01 AT                     113.7 
2016-10-01 AT                     114.7 
2016-11-01 AT                     116.4 
2016-12-01 AT                     115.6 
2017-01-01 AT                     113.2 
2017-02-01 AT                     116.5 
2017-03-01 AT                     117.5 
2017-04-01 AT                     117.6 
2017-05-01 AT                     117.3 
2017-06-01 AT                     118.2 
2017-07-01 AT                    119.9 p
2017-08-01 AT                         :

In this selection, most values are floating-point numbers, as one would expect of an index. The penultimate value, however, contains the letter “p”. What does that mean?

Flag values

A quick look at the interactive online table version of the dataset (notably, this info is not in the EU ODP metadata, but at least available on the Eurostat webpage) reveals the nature of these letters:

Available flags:
b break in time series
c confidential
d definition differs, see metadata
e estimated
f forecast
i see metadata (phased out)
n not significant
p provisional
r revised
s Eurostat estimate (phased out)
u low reliability
z not applicable

Thus the “p” marks a provisional value, in contrast to the confirmed values without this flag.

Missing values

The last value in the selection, a colon, stands for a missing value:

Special value:
: not available

Splitting off the flag values

To clean up the production index values, the flags should be identified and moved to their own column (so that I don’t lose the information). The remaining numbers should be converted into floats, and the colons replaced by “NaN” values.

Defining a pattern with regular expressions

I start with extracting the flags from the production index values. I do this using regular expressions and the Python package “re”. The string values with flags, e.g., '119.9 p', follow a particular pattern that can be translated into a regular expression (see expression in brackets): one or more digits ('\d+'), maybe followed by a decimal point ('.?') and another digit ('\d?'), zero or more spaces ('\s*') and a group of flags as lowercase letters ('([a-z]+)'). Concatenating all these strings yields the pattern string, which has to be passed to the re.compile() method to define a pattern that can be understood by the “re” package:

import re  # Package for regular expressions
flag_pattern = re.compile('\d+.?\d?\s*([a-z]+)')  # Define pattern to match

Checking production index values for the pattern

Now I define a function that checks if a string matches flag_pattern and returns the matching flags:

def get_flags(string_value):
    '''Returns lower-case letter flags (as string) from a production index string value'''
    match = flag_pattern.match(string_value)
    if bool(match):
        return match.group(1)  # Returns first matching group
    else:
        return ''              # Else return empty string

Applying the pattern matching to the dataframe

I need to pass this function to the apply() method of the dataframe in order to apply it to all rows and store the results in a new flags column:

df['flags'] = df['production_index'].apply(get_flags)
print(df.loc[(slice('2016-09','2017-08'),'AT'),:])

                        production_index flags
time       country_code                       
2016-09-01 AT                     113.7       
2016-10-01 AT                     114.7       
2016-11-01 AT                     116.4       
2016-12-01 AT                     115.6       
2017-01-01 AT                     113.2       
2017-02-01 AT                     116.5       
2017-03-01 AT                     117.5       
2017-04-01 AT                     117.6       
2017-05-01 AT                     117.3       
2017-06-01 AT                     118.2       
2017-07-01 AT                    119.9 p     p
2017-08-01 AT                         :

Flag categories

How many different flag combinations are there? To find out, let’s use the unique() method:

print(df['flags'].unique())

['' 's' 'p']

So there are three different flag combinations:
'': no flags
'p': provisional
's': Eurostat estimate (phased out)

These strings mark distinct categories.

It is thus natural to convert the flag strings into categorical values. I can then look at the categories attribute to confirm the number of categories after the conversion:

df['flags'] = df['flags'].astype('category')
print(df['flags'].cat.categories)

Index(['', 'p', 's'], dtype='object')

Extracting numbers and missing data

More regular expressions

As a next step, I extract the actual numbers from the value strings and replace the colon with a NaN value. I can do this again using regular expressions and a extraction/conversion function:

import numpy as np  # Needed for NaN values

number_pattern = re.compile('(\d+.?\d?)\s*[a-z]*')  # Number group followed by zero or more flags
nan_pattern = re.compile(':')                       # Missing value

def get_number(string_value):
    '''Returns production index value (as float) from a production index string value'''
    if bool(nan_pattern.match(string_value)):
        return np.nan
    else:
        match = number_pattern.match(string_value)
        if bool(match):
            return float(match.group(1))  # Returns first (and only) matching group
        else:
            print(string_value)           # Or raises error if there is no match
            raise ValueError('Production index value string does not match number pattern')

Applying the extraction function

I apply the get_number() function to the industry_production column:

df['production_index_float'] = df['production_index'].apply(get_number)
print(df.loc[(slice('2016-09','2017-08'),'AT'),:])

                        production_index flags  production_index_float
time       country_code                                               
2016-09-01 AT                     113.7                          113.7
2016-10-01 AT                     114.7                          114.7
2016-11-01 AT                     116.4                          116.4
2016-12-01 AT                     115.6                          115.6
2017-01-01 AT                     113.2                          113.2
2017-02-01 AT                     116.5                          116.5
2017-03-01 AT                     117.5                          117.5
2017-04-01 AT                     117.6                          117.6
2017-05-01 AT                     117.3                          117.3
2017-06-01 AT                     118.2                          118.2
2017-07-01 AT                    119.9 p     p                   119.9
2017-08-01 AT                         :                            NaN

Deleting duplicate information

Last but not least, I can replace the old production_index string column with the new float column and then get rid of the duplicate column with the drop() method:

df['production_index'] = df['production_index_float']
df.drop('production_index_float', axis=1, inplace=True)
print(df.loc[(slice('2016-09','2017-08'),'AT'),:])
print(df.info())

                         production_index flags
time       country_code                        
2016-09-01 AT                       113.7      
2016-10-01 AT                       114.7      
2016-11-01 AT                       116.4      
2016-12-01 AT                       115.6      
2017-01-01 AT                       113.2      
2017-02-01 AT                       116.5      
2017-03-01 AT                       117.5      
2017-04-01 AT                       117.6      
2017-05-01 AT                       117.3      
2017-06-01 AT                       118.2      
2017-07-01 AT                       119.9     p
2017-08-01 AT                         NaN      

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 27936 entries, (1953-01-01 00:00:00, AT) to (2017-08-01 00:00:00, UK)
Data columns (total 2 columns):
production_index    8744 non-null float64
flags               27936 non-null category
dtypes: category(1), float64(1)
memory usage: 355.2+ KB
None

This is the clean dataframe with the fixed variables time (datetimes) and country_code (strings) as the MultiIndex, and the measured variables production_index (floats) and flags (categorical) as data columns. Note that only one third of the production index values are non-null.

Conclusion

I have inspected the production index column of the EU industry production dataset and learned about its values: The string values contain the actual index as a floating-point number and sometimes an additional flag letter. Missing values are denoted by single colons. With this knowledge, I have employed the power of regular expressions to split numbers and flags and to identify missing values. I converted the numbers into floats, the missing values into “NaN” values and the flags into categories.

The cleaning process requires meta information about the dataset to make sure that the values are interpreted correctly. Unfortunately, in the EU Open Data Portal this information is not provided in a clear fashion but has to be searched for, which can be time-consuming.

Code

The project code was written using Jupyter Notebook 5.0.0, running the Python 3.6.1 kernel and Anaconda 4.4.0.

The Jupyter notebook can be found on Github.

Bio

I am a data scientist with a background in solar physics, with a long experience of turning complex data into valuable insights. Originally coming from Matlab, I now use the Python stack to solve problems.

Contact details

Jan Langfellner
contact@jan-langfellner.de
linkedin.com/in/jan-langfellner/

PhD thesis

Impressum