Reducing complexity – from a time series to a single number: modeling

TL;DR

I select a linear model with slope and intercept parameters to describe the growth dynamics of the EU industry production index of each country.

Long Description

Inspired by line plots of the EU industry production index time series that were previously normalized by the EU average time series, I choose to model the individual countries’ time series as linear in the available period covering the years 2000–2017. Concretely, I will use the linear model from the scikit-learn package and prepare the data accordingly by transforming the time axis. As a rough means to evaluate the robustness of the slope parameter, I decide to compare it to the difference of the last and first data points divided by the number of years.

Table of contents

Project Background

In the previous project, I have divided the EU industry production index time series for each country by the EU average to obtain a set of time series that are largely independent of the common development.

I now want to distill single numbers from the time series that capture the growth behavior for each country and allow an easy comparison. For that purpose, I am looking for a parameterized model that describes the long-term trends of the time series.

Choosing a model

Inspecting the data

I first read in the normalized EU industry production index time series:

import pandas as pd
df = pd.read_pickle('EU_industry_production_dataframe_normalized_2000-2017.pkl')
df.info()
print(df.head())
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 7632 entries, (2000-01-01 00:00:00, AT) to (2017-08-01 00:00:00, UK)
Data columns (total 3 columns):
production_index         7412 non-null float64
flags                    7632 non-null category
production_index_norm    7404 non-null float64
dtypes: category(1), float64(2)
memory usage: 155.5+ KB

                         production_index flags  production_index_norm
time       country_code                                               
2000-01-01 AT                        73.0                     0.763119
           BA                         NaN                          NaN
           BE                        63.1                     0.659628
           BG                        61.1                     0.638720
           CY                       103.0                     1.076730

To get a better view at the individual time series, I split the 36 curves into two plots. I select first only the first half and then only the second half of the country columns using the iloc locator:

import matplotlib.pyplot as plt
import numpy as np

################# First figure
fig, ax = plt.subplots()        # Create figure and axes objects

# Reset line properties to avoid ambiguity between different lines:
ax.set_prop_cycle(color=plt.cm.nipy_spectral(np.linspace(0,1,36)),linestyle=['-','--',':']*12)

# Create the plot:
ax = df['production_index_norm'].unstack(level=1).iloc[:,0:18].plot(ax=ax)
plt.xlabel('Year')
plt.ylabel('Production index [relative to EU28 avg.]')
ax.legend(ncol=6, fontsize=7)   # Adjust shape (four columns instead of one) and font size of legend
plt.show()

################# Second figure
fig, ax = plt.subplots()        # Create figure and axes objects

# Reset line properties to avoid ambiguity between different lines:
ax.set_prop_cycle(color=plt.cm.nipy_spectral(np.linspace(0.5,1,18)),linestyle=['-','--',':']*6)

# Create the plot:
ax = df['production_index_norm'].unstack(level=1).iloc[:,18:].plot(ax=ax)
plt.xlabel('Year')
plt.ylabel('Production index [relative to EU28 avg.]')
ax.legend(ncol=6, fontsize=7)   # Adjust shape (four columns instead of one) and font size of legend
plt.show()



Though there are a few curves that show more complex long-term dynamics (e.g., Cyprus, CY), the vast majority can be described as roughly linear plus some short- to mid-scale fluctuations.

Linear approximation

This motivates a linear approximation of the time series. In such a simple model, y = m * x + b, the production index (y) depends on the time (x) and on two parameters: m gives the slope of the line and is thus a measure of the growth that I am looking for, b is the production index value at time zero (the intercept).

Time zero is a bit unpractical though when using the calendar year to express time. A more natural approach would be to count years from the reference point of the dataset, that is the year 2010.

How do I get a linear approximation of the time series? What I want is to find those values for m and b that minimize the sum of squared distances between the line and the real production index at each time (sum of squared residuals). This process is called linear regression.

Since y does only depend on one variable (x) in this case, I could use the numpy method polyfit() for this task.

In the spirit of a more general answer that would also work for more complex datasets (y depending on many variables), I opt to use the linear regression method in the scikit-learn package though.

Model evaluation

The quality of the model could be evaluated in different ways. One measure is the sum of the squared residuals, which acts as a cost function. It can be used as a basis to test if the time series is linear plus random noise. As I already know from the plots that there are also more complex mid-scale features, the benefits of such a test are limited though.

The model’s predictive power or consistency over time could be assessed by splitting the dataset in two (or more) parts. However, the time series is quite short.

Instead I will use a simple robustness check of the slopes from linear regression: I will also obtain m from a simple difference of the production index between the last and first data points in the time series. Dividing this difference by the number of years should give an alternative measure of m

Data preparation

Before performing the linear regression in the next project, I make the transformation of the time axis and unstack the “country_code” index level:

# Unstack country_code index into columns:
df_forregression = df['production_index_norm'].unstack(level=1).copy()

# Reset DatetimeIndex to FloatIndex (counting years from 2010):
df_forregression.set_index((df_forregression.index - pd.Timestamp('2010-01-01')) / pd.Timedelta(365.25, unit='d'), inplace=True)
print(df_forregression.info())
print(df_forregression.head())


Float64Index: 212 entries, -10.0013689254 to 7.58110882957
Data columns (total 36 columns):
AT 211 non-null float64
BA 139 non-null float64
BE 211 non-null float64
BG 211 non-null float64
CY 211 non-null float64
CZ 211 non-null float64
DE 211 non-null float64
DK 211 non-null float64
EA19 211 non-null float64
EE 211 non-null float64
EL 211 non-null float64
ES 211 non-null float64
EU28 211 non-null float64
FI 211 non-null float64
FR 211 non-null float64
HR 211 non-null float64
HU 211 non-null float64
IE 211 non-null float64
IT 211 non-null float64
LT 211 non-null float64
LU 211 non-null float64
LV 211 non-null float64
ME 91 non-null float64
MK 211 non-null float64
MT 211 non-null float64
NL 211 non-null float64
NO 211 non-null float64
PL 211 non-null float64
PT 211 non-null float64
RO 211 non-null float64
RS 211 non-null float64
SE 211 non-null float64
SI 211 non-null float64
SK 211 non-null float64
TR 211 non-null float64
UK 211 non-null float64
dtypes: float64(36)
memory usage: 61.3 KB
None

country_code AT BA BE BG CY CZ DE \
time
-10.001369 0.763119 NaN 0.659628 0.638720 1.076730 0.609450 0.882291
-9.916496 0.804837 NaN 0.675563 0.650542 1.094662 0.628649 0.900751
-9.837098 0.799001 NaN 0.682480 0.650229 1.071577 0.636704 0.900957
-9.752225 0.806887 NaN 0.703174 0.646132 1.076540 0.650280 0.909562
-9.670089 0.819436 NaN 0.705168 0.643401 1.054149 0.660902 0.923410

country_code DK EA19 EE … NO PL \
time …
-10.001369 1.108091 1.015053 0.577044 … 0.993101 0.510140
-9.916496 1.097790 1.027940 0.606756 … 0.988324 0.521268
-9.837098 1.114232 1.029963 0.587807 … 0.989388 0.516022
-9.752225 1.093134 1.036092 0.602572 … 0.984236 0.533084
-9.670089 1.179741 1.043854 0.593988 … 0.957381 0.542516

country_code PT RO RS SE SI SK \
time
-10.001369 1.266987 0.752666 0.963830 0.950240 0.822705 0.461008
-9.916496 1.256255 0.753753 1.083194 0.987281 0.845496 0.473311
-9.837098 1.190179 0.771952 1.087183 0.995630 0.857262 0.487932
-9.752225 1.176105 0.781995 1.097283 1.015350 0.854594 0.490562
-9.670089 1.220918 0.779288 1.089150 1.019148 0.866790 0.456043

country_code TR UK
time
-10.001369 0.614677 1.116454
-9.916496 0.631776 1.129066
-9.837098 0.616937 1.125676
-9.752225 0.643020 1.116988
-9.670089 0.644431 1.113856

[5 rows x 36 columns]

Subtracting one DateTime object from another (the DateTimeIndex) produces a Timedelta object, with the unit “days”. Note that the Timedelta objects cannot be exactly converted into years or months, because these quantities vary in length (e.g., leap years).

For my basic analysis, such details are, however, negligible and so I divide by 365.25 days to obtain a Float64Index with values that can be interpreted as years: The range is from -10 (10 years before 2010) to 7.5 (7.5 years after 2010).

Conclusion

I have studied the EU industry production time series for each country and selected a linear model to capture the growth dynamics in the dataset with the slope parameter. In preparation for the linear regression procedure, I have transformed the time axis from calendar date to year count from the reference year 2010. To provide a simple means of evaluation, I will also estimate the slope by measuring the difference between the last and the first data points.

In Python, there are often different ways to achieve the same goal. The linear regression can be performed using, among others, the numpy polynomial fit method or the linear model from the scikit-learn package.

Code

The project code was written using Jupyter Notebook 5.0.0, running the Python 3.6.3 kernel and Anaconda 5.0.1.

The Jupyter notebook can be found on Github.

Bio

I am a data scientist with a background in solar physics, with a long experience of turning complex data into valuable insights. Originally coming from Matlab, I now use the Python stack to solve problems.

Contact details

Jan Langfellner
contact@jan-langfellner.de
linkedin.com/in/jan-langfellner/

PhD thesis

Impressum

One thought on “Reducing complexity – from a time series to a single number: modeling

Leave a Reply