Database crunchers API

Database crunchers.

The classes within this module can be used to crunch a database of scenarios. Each ‘Cruncher’ has methods which return functions which can then be used to infill emissions detail (i.e. calculate ‘follower’ timeseries) based on ‘lead’ emissions timeseries.

Closest RMS cruncher API

Module for the database cruncher which uses the ‘closest RMS’ technique.

class silicone.database_crunchers.rms_closest.RMSClosest(db)[source]

Bases: _DatabaseCruncher

Database cruncher which uses the ‘closest RMS’ technkque.

This cruncher derives the relationship between two or more variables by finding the scenario which has the most similar timeseries for the lead gases in the database. The follower gas timeseries is then simply copied from the closest scenario.

Here, ‘most similar’ is defined as the smallest time-averaged root mean squared (L2) difference. If multiple lead values are used, they may be weighted differently to account for differences between the reported units. The most similar model/scenario combination minimises

\[RMS error = \sum_l w_l \left ( \frac{1}{n} \sum_{t=0}^n (E_l(t) - e_l(t))^2 \right )^{1/2}\]

where \(l\) is a lead gas, \(w_l\) is a weighting for that lead gas, \(n\) is the total number of timesteps in all lead gas timeseries, \(E_l(t)\) is the lead gas emissions timeseries and \(e_l(t)\) is a lead gas emissions timeseries in the infiller database.

derive_relationship(variable_follower, variable_leaders, weighting=None)[source]

Derive the relationship between the lead and the follow variables from the database.

Parameters
  • variable_follower (str) – The variable for which we want to calculate timeseries (e.g. "Emissions|C5F12").

  • variable_leaders (list[str]) – The variable we want to use in order to infer timeseries of variable_follower (e.g. ["Emissions|CO2"]). This may contain multiple elements.

  • weighting (dict{str: float}) – When used with multiple lead variables, this weighting factor controls the relative importance of different variables for determining closeness. E.g. if wanting to compare both CO2 and CH4 emissions reported in mass units but weighted by the AR5 GWP100 metric, this would be {“Emissions|CO2”: 1, “Emissions|CH4”: 28}.

Returns

Function which takes a pyam.IamDataFrame containing variable_leaders timeseries and returns timeseries for variable_follower based on the derived relationship between the two. Please see the source code for the exact definition (and docstring) of the returned function.

Return type

func

Raises
  • ValueErrorvariable_leaders contains more than one variable.

  • ValueError – There is no data for variable_leaders or variable_follower in the database.

Constant ratio cruncher API

Module for the database cruncher which uses the ‘constant given ratio’ technique.

class silicone.database_crunchers.constant_ratio.ConstantRatio(db=None)[source]

Bases: _DatabaseCruncher

Database cruncher which uses the ‘constant given ratio’ technique.

This cruncher does not require a database upon initialisation. Instead, it requires a constant and a unit to be input when deriving relations. This constant, \(s\), is the ratio of the follower variable to the lead variable i.e.:

\[E_f(t) = s * E_l(t)\]

where \(E_f(t)\) is emissions of the follower variable and \(E_l(t)\) is emissions of the lead variable.

derive_relationship(variable_follower, variable_leaders, ratio, units)[source]

Derive the relationship between two variables from the database.

Parameters
  • variable_follower (str) – The variable for which we want to calculate timeseries (e.g. "Emissions|C5F12").

  • variable_leaders (list[str]) – The variable we want to use in order to infer timeseries of variable_follower (e.g. ["Emissions|CO2"]).

  • ratio (float) – The ratio between the leader and the follower data

  • units (str) – The units of the follower data.

Returns

Function which takes a pyam.IamDataFrame containing variable_leaders timeseries and returns timeseries for variable_follower based on the derived relationship between the two.

Return type

func

Equal quantile walk cruncher API

Module for the database cruncher which uses the ‘equal quantile walk’ technique.

class silicone.database_crunchers.equal_quantile_walk.EqualQuantileWalk(db)[source]

Bases: _DatabaseCruncher

Database cruncher which uses the ‘equal quantile walk’ technique.

This cruncher assumes that the amount of effort going into reducing one emission set is equal to that for another emission, therefore the lead and follow data should be at the same quantile of all pathways in the infiller database. It calculates the quantile of the lead infillee data in the lead infiller database, then outputs that quantile of the follow data in the infiller database.

derive_relationship(variable_follower, variable_leaders, smoothing=None, weighting=None)[source]

Derive the relationship between two variables from the database.

Parameters
  • variable_follower (str) – The variable for which we want to calculate timeseries (e.g. "Emissions|C5F12").

  • variable_leaders (list[str]) – The variable we want to use in order to infer timeseries of variable_follower (e.g. ["Emissions|CO2"]).

  • smoothing (float or string) – By default, no smoothing is done on the distribution. If a value is provided, it is fed into scipy.stats.gaussian_kde() - see full documentation there. In short, if a float is input, we fit a Gaussian kernel density estimator with that width to the points. If a string is used, it must be either “scott” or “silverman”, after those two methods of determining the best kernel bandwidth.

  • weighting (Dict{(str, str) : float}) – The dictionary, mapping the (mode, scenario) tuple onto the weight (relative to a weight of 1 for the default). This does not have to include all scenarios in df, but cannot include scenarios not in df.

Returns

Function which takes a pyam.IamDataFrame containing variable_leaders timeseries and returns timeseries for variable_follower based on the derived relationship between the two. Please see the source code for the exact definition (and docstring) of the returned function.

Return type

func

Raises
  • ValueErrorvariable_leaders contains more than one variable.

  • ValueError – There is no data for variable_leaders or variable_follower in the database.

Interpolate specified scenarios and models cruncher API

class silicone.database_crunchers.interpolate_specified_scenarios_and_models.ScenarioAndModelSpecificInterpolate(db)[source]

Bases: _DatabaseCruncher

Database cruncher which pre-filters to only use data from specific scenarios, then runs the interpolation cruncher to return values from that set of scenarios. See the documentation of Interpolation for more details.

derive_relationship(variable_follower, variable_leaders, required_scenario='*', required_model='*', interpkind='linear')[source]

Derive the relationship between two variables from the database.

Parameters
  • variable_follower (str) – The variable for which we want to calculate timeseries (e.g. "Emissions|CH4").

  • variable_leaders (list[str]) – The variable(s) we want to use in order to infer timeseries of variable_follower (e.g. ["Emissions|CO2"]).

  • required_scenario (str or list[str]) – The string(s) which all relevant scenarios are required to match. This may have s to represent wild cards. It defaults to “” to accept all scenarios.

  • required_model (str or list[str]) – The string(s) which all relevant models are required to match. This may have s to represent wild cards. It defaults to “” to accept all models.

  • interpkind (str) – The style of interpolation. By default, linear, but can also be any value accepted as the “kind” option in scipy.interpolate.interp1d, or “PchipInterpolator”, in which case scipy.interpolate.PchipInterpolator is used. Care should be taken if using non-default interp1d options, as they are either uneven or have “ringing”

Returns

Function which takes a pyam.IamDataFrame containing variable_leaders timeseries and returns timeseries for variable_follower based on the derived relationship between the two. Please see the source code for the exact definition (and docstring) of the returned function.

Return type

func

Raises

ValueError – There is no data of the appropriate type in the database. There may be a typo in the SSP option.

Latest Time Ratio API

Module for the database cruncher which uses the ‘latest time ratio’ technique.

class silicone.database_crunchers.latest_time_ratio.LatestTimeRatio(db)[source]

Bases: _DatabaseCruncher

Database cruncher which uses the ‘latest time ratio’ technique.

This cruncher derives the relationship between two variables by simply assuming that the follower timeseries is equal to the lead timeseries multiplied by a scaling factor. The scaling factor is derived by calculating the ratio of the follower variable to the lead variable in the latest year in which the follower variable is available in the database. Additionally, since the derived relationship only depends on a single point in the database, no regressions or other calculations are performed.

Once the relationship is derived, the ‘filler’ function will infill following:

\[E_f(t) = R * E_l(t)\]

where \(E_f(t)\) is emissions of the follower variable and \(E_l(t)\) is emissions of the lead variable, both in the infillee database.

\(R\) is the scaling factor, calculated as

\[R = \frac{ E_f(t_{\text{last}}) }{ e_l(t_{\text{last}}) }\]

where \(t_{\text{last}}\) is the average of all values of the follower gas at the latest time it appears in the database, and the lower case \(e\) represents the infiller database.

derive_relationship(variable_follower, variable_leaders)[source]

Derive the relationship between two variables from the database.

Parameters
  • variable_follower (str) – The variable for which we want to calculate timeseries (e.g. "Emissions|C5F12").

  • variable_leaders (list[str]) – The variable we want to use in order to infer timeseries of variable_follower (e.g. ["Emissions|CO2"]). Note that the ‘latest time ratio’ methodology gives the same result, independent of the value of variable_leaders in the database.

Returns

Function which takes a pyam.IamDataFrame containing variable_leaders timeseries and returns timeseries for variable_follower based on the derived relationship between the two. Please see the source code for the exact definition (and docstring) of the returned function.

Return type

func

Raises
  • ValueErrorvariable_leaders contains more than one variable.

  • ValueError – There is no data for variable_leaders or variable_follower in the database.

Linear interpolation cruncher API

Module for the database cruncher which makes a linear interpolator between known values

class silicone.database_crunchers.linear_interpolation.LinearInterpolation(db)[source]

Bases: Interpolation

Database cruncher which uses linear interpolation. This cruncher is deprecated; use Interpolation instead.

This cruncher derives the relationship between two variables by linearly interpolating between values in the cruncher database. It does not do any smoothing and is best-suited for smaller databases.

In the case where there is more than one value of the follower variable for a given value of the leader variable, the average will be used. For example, if one scenario has CH4 emissions of 10 MtCH4/yr whilst another has CH4 emissions of 20 MtCH4/yr in 2020 whilst both scenarios have CO2 emissions of exactly 15 GtC/yr in 2020, the interpolation will use the average value from the two scenarios i.e. 15 Mt CH4/yr.

Beyond the bounds of input data, the interpolation is held constant. For example, if the maximum CO2 emissions in 2020 in the database is 25 GtC/yr, and CH4 emissions for this level of CO2 emissions are 15 MtCH4/yr, then even if we infill using a CO2 emissions value of 100 GtC/yr in 2020, the returned CH4 emissions will be 15 MtCH4/yr.

derive_relationship(variable_follower, variable_leaders)[source]

Derive the relationship between two variables from the database.

Parameters
  • variable_follower (str) – The variable for which we want to calculate timeseries (e.g. "Emissions|CH4").

  • variable_leaders (list[str]) – The variable(s) we want to use in order to infer timeseries of variable_follower (e.g. ["Emissions|CO2"]).

  • interpkind (str) – The style of interpolation. By default, linear (hence the name), but can also be any value accepted as the “kind” option in scipy.interpolate.interp1d.

Returns

Function which takes a pyam.IamDataFrame containing variable_leaders timeseries and returns timeseries for variable_follower based on the derived relationship between the two. Please see the source code for the exact definition (and docstring) of the returned function.

Return type

func

Raises

ValueError – There is no data of the appropriate type in the database.

Quantile rolling windows cruncher API

Module for the database cruncher which uses the ‘rolling windows’ technique.

class silicone.database_crunchers.quantile_rolling_windows.QuantileRollingWindows(db)[source]

Bases: _DatabaseCruncher

Database cruncher which uses the ‘rolling windows’ technique.

This cruncher derives the relationship between two variables by performing quantile calculations between the follower timeseries and the lead timeseries. These calculations are performed at each timestep in the timeseries, independent of the other timesteps.

For each timestep, the lead timeseries axis is divided into multiple evenly spaced windows (to date this is only tested on 1:1 relationships but may work with more than one lead timeseries). In each window, every data point in the database is included. However, the data points receive a weight given by

\[w(x, x_{\text{window}}) = \frac{1}{1 + (d_n)^2}\]

where \(w\) is the weight and \(d_n\) is the normalised distance between the centre of the window and the data point’s position on the lead timeseries axis.

\(d_n\) is calculated as

\[d_n = \frac{x - x_{\text{window}}}{f \times (\frac{b}{2})}\]

where \(x\) is the position of the data point on the lead timeseries axis, \(x_{\text{window}}\) is the position of the centre of the window on the lead timeseries axis, \(b\) is the distance between window centres and \(f\) is a decay factor which controls how much less points away from \(x_{\text{window}}\) are weighted. If \(f=1\) then a point which is half the width between window centres away receives a weighting of \(1/2\). Lowering the value of \(f\) cause points further from the window centre to receive less weight.

With these weightings, the desired quantile of the data is then calculated. This calculation is done by sorting the data by the database’s follow timeseries values (then by lead timeseries values in the case of identical follow values). From here, the weight of each point is calculated following the formula given above. We calculate the cumulative sum of weights, and then the cumulative sum up to half weights, defined by

\[c_{hw} = c_w - 0.5 \times w\]

where \(c_w\) is the cumulative weights and \(w\) is the raw weights. This ensures that quantiles less than half the weight of the smallest follow value return the smallest follow value and more than one minus half the weight of the largest follow value return the largest value. Without such a shift, the largest value is only returned if the quantile is 1, leading to a bias towards smaller values.

With these calculations, we have determined the relationship between the follow timeseries values and the quantile i.e. cumulative sum of (normalised) weights. We can then determine arbitrary quantiles by linearly interpolating.

If the option use_ratio is set to True, instead of returning the absolute value of the follow at this quantile, we return the quantile of the ratio between the lead and follow data in the database, multiplied by the actual lead value of the database being infilled.

By varying the quantile, this cruncher can provide ranges of the relationship between different variables. For example, it can provide the 90th percentile (i.e. high end) of the relationship between e.g. Emissions|CH4 and Emissions|CO2 or the 50th percentile (i.e. median) or any other arbitrary percentile/quantile choice. Note that the impact of this will strongly depend on nwindows and decay_length_factor. Using the TimeDepQuantileRollingWindows class makes it is possible to specify a dictionary of dates to quantiles, in which case we return that quantile for that year or date.

derive_relationship(variable_follower, variable_leaders, quantile=0.5, nwindows=11, decay_length_factor=1, use_ratio=False)[source]

Derive the relationship between two variables from the database.

Parameters
  • variable_follower (str) – The variable for which we want to calculate timeseries (e.g. "Emissions|CH4").

  • variable_leaders (list[str]) – The variable(s) we want to use in order to infer timeseries of variable_follower (e.g. ["Emissions|CO2"]).

  • quantile (float) – The quantile to return in each window.

  • nwindows (int) – The number of window centers to use when calculating the relationship between the follower and lead gases.

  • decay_length_factor (float) – Parameter which controls how strongly points away from the window’s centre should be weighted compared to points at the centre. Larger values give points further away increasingly less weight, smaller values give points further away increasingly more weight.

  • use_ratio (bool) – If false, we use the quantile value of the weighted mean absolute value. If true, we find the quantile weighted mean ratio between lead and follow, then multiply the ratio by the input value.

Returns

Function which takes a pyam.IamDataFrame containing variable_leaders timeseries and returns timeseries for variable_follower based on the derived relationship between the two. Please see the source code for the exact definition (and docstring) of the returned function.

Return type

func

Raises
  • ValueError – There is no data for variable_leaders or variable_follower in the database.

  • ValueErrorquantile is not between 0 and 1.

  • ValueErrornwindows is not equivalent to an integer or is not greater than 1.

  • ValueErrordecay_length_factor is 0.

Time dependent quantile rolling windows cruncher API

Module for the database cruncher which uses the ‘rolling windows’ technique with different quantiles in different years.

class silicone.database_crunchers.time_dep_quantile_rolling_windows.TimeDepQuantileRollingWindows(db)[source]

Bases: _DatabaseCruncher

Database cruncher which uses QuantileRollingWindows with different quantiles in every year/datetime.

derive_relationship(variable_follower, variable_leaders, time_quantile_dict, **kwargs)[source]

Derive the relationship between two variables from the database.

For details of most parameters, see QuantileRollingWindows. The one different parameter is time_quantile_dict, described below:

Parameters
  • variable_follower (str) – The variable for which we want to calculate timeseries (e.g. "Emissions|CH4").

  • variable_leaders (list[str]) – The variable(s) we want to use in order to infer timeseries of variable_follower (e.g. ["Emissions|CO2"]).

  • time_quantile_dict (dict{datetime or int: float}) – Every year or datetime in the infillee database must be specified as a key. The value is the quantile to use in that year. Note that the impact of the quantile value is strongly dependent on the arguments passed to QuantileRollingWindows.

  • **kwargs – Passed to QuantileRollingWindows.

Returns

Function which takes a pyam.IamDataFrame containing variable_leaders timeseries and returns timeseries for variable_follower based on the derived relationship between the two. Please see the source code for the exact definition (and docstring) of the returned function.

Return type

func

Raises

ValueError – Not all times in time_quantile_dict have data in the database.

Time dependent ratio cruncher API

Module for the database cruncher which uses the ‘time-dependent ratio’ technique.

class silicone.database_crunchers.time_dep_ratio.TimeDepRatio(db)[source]

Bases: _DatabaseCruncher

Database cruncher which uses the ‘time-dependent ratio’ technique.

This cruncher derives the relationship between two variables by simply assuming that the follower timeseries is equal to the lead timeseries multiplied by a time-dependent scaling factor. The scaling factor is the ratio of the follower variable to the lead variable. If the database contains many such pairs, the scaling factor is the ratio between the means of the values. By default, the calculation will include only values where the lead variable takes the same sign (+ or -) in the infilling database as in the case infilled. This prevents getting negative values of emissions that cannot be negative. To allow cases where we have no data of the correct sign, set same_sign = False in derive_relationship.

Once the relationship is derived, the ‘filler’ function will infill following:

\[E_f(t) = R(t) * E_l(t)\]

where \(E_f(t)\) is emissions of the follower variable and \(E_l(t)\) is emissions of the lead variable.

\(R(t)\) is the scaling factor, calculated as the ratio of the means of the the follower and the leader in the infiller database, denoted with lower case e. By default, we include only cases where sign(e_l(t)) is the same in both databases). The cruncher will raise a warning if the lead data is ever negative, which can create complications for the use of this cruncher.

\[R(t) = \frac{mean( e_f(t) )}{mean( e_l(t) )})\]
derive_relationship(variable_follower, variable_leaders, same_sign=True, only_consistent_cases=True)[source]

Derive the relationship between two variables from the database.

Parameters
  • variable_follower (str) – The variable for which we want to calculate timeseries (e.g. "Emissions|C5F12").

  • variable_leaders (list[str]) – The variable we want to use in order to infer timeseries of variable_follower (e.g. ["Emissions|CO2"]).

  • same_sign (bool) – Do we want to only use data where the leader has the same sign in the infiller and infillee data? If so, we have a potential error from not having data of the correct sign, but have more confidence in the sign of the follower data.

  • only_consistent_cases (bool) – Do we want to only use model/scenario combinations where both lead and follow have data at all times? This will reduce the risk of inconsistencies or unevenness in the results, but will slightly decrease performance speed if you know the data is consistent. Senario/model pairs where data is only returned at certain times will be removed, as will any scenarios not returning both lead and follow data.

Returns

Function which takes a pyam.IamDataFrame containing variable_leaders timeseries and returns timeseries for variable_follower based on the derived relationship between the two. Please see the source code for the exact definition (and docstring) of the returned function.

Return type

func

Raises
  • ValueErrorvariable_leaders contains more than one variable.

  • ValueError – There is no data for variable_leaders or variable_follower in the database.