I make EU industry production index values, which I previously put in a tidy form, ready for analysis by splitting numbers and flag values with pandas methods.
Now that the EU industry production dataset has a tidy dataframe structure, I clean up the production index values. The values are stored as strings and may contain additional flag letters; missing values are denoted with colons. Information about the meaning of the flags can be drawn from the online documentation at the original data source. I identify the flags with the help of regular expressions and the pandas “re” package and store the values as categorical data in a new column with the pandas “apply” method. With another set of regular expressions, I extract the actual numbers from the production index column and convert them to floating-point values; the missing data are stored as NaN values. After this treatment, the production index values are ready for an actual data analysis.
Table of contents
- 1 TL;DR
- 2 Long Description
- 3 Project Background
- 4 Inspecting the dataframe
- 5 Splitting off the flag values
- 6 Extracting numbers and missing data
- 7 Deleting duplicate information
- 8 Conclusion
- 9 Code
- 10 Bio
- 11 Contact details
In the previous project, I dealt with the structure of the EU industry production dataframe, i.e. the rows and columns. Here, the focus is on cleaning the actual production index values. This is necessary to allow for an easy analysis of the dataset using pandas methods and functions from Python visualization packages (e.g., matplotlib, Seaborn).
Inspecting the dataframe
Tidy dataframe structure
I start with the dataframe in the tidy shape from the previous project. As a reminder, this is how the dataframe looks like:
<class 'pandas.core.frame.DataFrame'> MultiIndex: 27936 entries, (1953-01-01 00:00:00, AT) to (2017-08-01 00:00:00, UK) Data columns (total 1 columns): production_index 27936 non-null object dtypes: object(1) memory usage: 306.5+ KB production_index time country_code 1953-01-01 AT : BA : BE : BG : CY :
In the tidy form, each row is an observation and each column a variable.
Production index values
production_index column consists of string (
non-null object) values. In the head, there are just colons though.
To get more useful insights into the possible values, let’s have a look at the most recent production index values for Austria (country code ‘AT’) – say the last year. Because I want to specify a range in the time component, I have to apply
slice() with the beginning and the end (inclusive) of the period inside the
production_index time country_code 2016-09-01 AT 113.7 2016-10-01 AT 114.7 2016-11-01 AT 116.4 2016-12-01 AT 115.6 2017-01-01 AT 113.2 2017-02-01 AT 116.5 2017-03-01 AT 117.5 2017-04-01 AT 117.6 2017-05-01 AT 117.3 2017-06-01 AT 118.2 2017-07-01 AT 119.9 p 2017-08-01 AT :
In this selection, most values are floating-point numbers, as one would expect of an index. The penultimate value, however, contains the letter “p”. What does that mean?
A quick look at the interactive online table version of the dataset (notably, this info is not in the EU ODP metadata, but at least available on the Eurostat webpage) reveals the nature of these letters:
b break in time series
d definition differs, see metadata
i see metadata (phased out)
n not significant
s Eurostat estimate (phased out)
u low reliability
z not applicable
Thus the “p” marks a provisional value, in contrast to the confirmed values without this flag.
The last value in the selection, a colon, stands for a missing value:
: not available
Splitting off the flag values
To clean up the production index values, the flags should be identified and moved to their own column (so that I don’t lose the information). The remaining numbers should be converted into floats, and the colons replaced by “NaN” values.
Defining a pattern with regular expressions
I start with extracting the flags from the production index values. I do this using regular expressions and the Python package “re”. The string values with flags, e.g.,
'119.9 p', follow a particular pattern that can be translated into a regular expression (see expression in brackets): one or more digits (
'\d+'), maybe followed by a decimal point (
'.?') and another digit (
'\d?'), zero or more spaces (
'\s*') and a group of flags as lowercase letters (
'([a-z]+)'). Concatenating all these strings yields the pattern string, which has to be passed to the
re.compile() method to define a pattern that can be understood by the “re” package:
import re # Package for regular expressions flag_pattern = re.compile('\d+.?\d?\s*([a-z]+)') # Define pattern to match
Checking production index values for the pattern
Now I define a function that checks if a string matches
flag_pattern and returns the matching flags:
def get_flags(string_value): '''Returns lower-case letter flags (as string) from a production index string value''' match = flag_pattern.match(string_value) if bool(match): return match.group(1) # Returns first matching group else: return '' # Else return empty string
Applying the pattern matching to the dataframe
I need to pass this function to the
apply() method of the dataframe in order to apply it to all rows and store the results in a new
df['flags'] = df['production_index'].apply(get_flags) print(df.loc[(slice('2016-09','2017-08'),'AT'),:])
production_index flags time country_code 2016-09-01 AT 113.7 2016-10-01 AT 114.7 2016-11-01 AT 116.4 2016-12-01 AT 115.6 2017-01-01 AT 113.2 2017-02-01 AT 116.5 2017-03-01 AT 117.5 2017-04-01 AT 117.6 2017-05-01 AT 117.3 2017-06-01 AT 118.2 2017-07-01 AT 119.9 p p 2017-08-01 AT :
How many different flag combinations are there? To find out, let’s use the
['' 's' 'p']
So there are three different flag combinations:
'': no flags
's': Eurostat estimate (phased out)
These strings mark distinct categories.
It is thus natural to convert the flag strings into categorical values. I can then look at the
categories attribute to confirm the number of categories after the conversion:
df['flags'] = df['flags'].astype('category') print(df['flags'].cat.categories)
Index(['', 'p', 's'], dtype='object')
Extracting numbers and missing data
More regular expressions
As a next step, I extract the actual numbers from the value strings and replace the colon with a NaN value. I can do this again using regular expressions and a extraction/conversion function:
import numpy as np # Needed for NaN values number_pattern = re.compile('(\d+.?\d?)\s*[a-z]*') # Number group followed by zero or more flags nan_pattern = re.compile(':') # Missing value def get_number(string_value): '''Returns production index value (as float) from a production index string value''' if bool(nan_pattern.match(string_value)): return np.nan else: match = number_pattern.match(string_value) if bool(match): return float(match.group(1)) # Returns first (and only) matching group else: print(string_value) # Or raises error if there is no match raise ValueError('Production index value string does not match number pattern')
Applying the extraction function
I apply the
get_number() function to the
df['production_index_float'] = df['production_index'].apply(get_number) print(df.loc[(slice('2016-09','2017-08'),'AT'),:])
production_index flags production_index_float time country_code 2016-09-01 AT 113.7 113.7 2016-10-01 AT 114.7 114.7 2016-11-01 AT 116.4 116.4 2016-12-01 AT 115.6 115.6 2017-01-01 AT 113.2 113.2 2017-02-01 AT 116.5 116.5 2017-03-01 AT 117.5 117.5 2017-04-01 AT 117.6 117.6 2017-05-01 AT 117.3 117.3 2017-06-01 AT 118.2 118.2 2017-07-01 AT 119.9 p p 119.9 2017-08-01 AT : NaN
Deleting duplicate information
Last but not least, I can replace the old
production_index string column with the new float column and then get rid of the duplicate column with the
df['production_index'] = df['production_index_float'] df.drop('production_index_float', axis=1, inplace=True) print(df.loc[(slice('2016-09','2017-08'),'AT'),:]) print(df.info())
production_index flags time country_code 2016-09-01 AT 113.7 2016-10-01 AT 114.7 2016-11-01 AT 116.4 2016-12-01 AT 115.6 2017-01-01 AT 113.2 2017-02-01 AT 116.5 2017-03-01 AT 117.5 2017-04-01 AT 117.6 2017-05-01 AT 117.3 2017-06-01 AT 118.2 2017-07-01 AT 119.9 p 2017-08-01 AT NaN <class 'pandas.core.frame.DataFrame'> MultiIndex: 27936 entries, (1953-01-01 00:00:00, AT) to (2017-08-01 00:00:00, UK) Data columns (total 2 columns): production_index 8744 non-null float64 flags 27936 non-null category dtypes: category(1), float64(1) memory usage: 355.2+ KB None
This is the clean dataframe with the fixed variables
time (datetimes) and
country_code (strings) as the MultiIndex, and the measured variables
production_index (floats) and
flags (categorical) as data columns. Note that only one third of the production index values are non-null.
I have inspected the production index column of the EU industry production dataset and learned about its values: The string values contain the actual index as a floating-point number and sometimes an additional flag letter. Missing values are denoted by single colons. With this knowledge, I have employed the power of regular expressions to split numbers and flags and to identify missing values. I converted the numbers into floats, the missing values into “NaN” values and the flags into categories.
The cleaning process requires meta information about the dataset to make sure that the values are interpreted correctly. Unfortunately, in the EU Open Data Portal this information is not provided in a clear fashion but has to be searched for, which can be time-consuming.
The project code was written using Jupyter Notebook 5.0.0, running the Python 3.6.1 kernel and Anaconda 4.4.0.
The Jupyter notebook can be found on Github.
I am a data scientist with a background in solar physics, with a long experience of turning complex data into valuable insights. Originally coming from Matlab, I now use the Python stack to solve problems.