Skip to content

mibiscreen.analysis API reference

mibiscreen module for data analysis.

reduction

mibiscreen module for data analysis reducing sample data.

ordination

Routines for performing ordination statistics on sample data.

@author: Alraune Zech, Jorrit Bakker

cca(data_frame, independent_variables, dependent_variables, n_comp=2, verbose=False)

Function that performs Canonical Correspondence Analysis.

Function makes use of skbio.stats.ordination.CCA on the input data and gives the site scores and loadings.

Input
data_frame : pd.dataframe
    Tabular data containing variables to be evaluated with standard
    column names and rows of sample data.
independent_variables : list of strings
    list with column names data to be the independent variables (=environment)
dependent_variables : list of strings
    list with column names data to be the dependen variables (=species)
n_comp : int, default is 2
    number of dimensions to return
verbose : Boolean, The default is False.
    Set to True to get messages in the Console about the status of the run code.
Output
results : Dictionary
    * method: name of ordination method (str)
    * loadings_independent: loadings of independent variables (np.ndarray)
    * loadings_dependent: loadings of dependent variables (np.ndarray)
    * names_independent: names of independent varialbes (list of str)
    * names_dependent: names of dependent varialbes (list of str)
    * scores: scores (np.ndarray)
    * sample_index: names of samples (list of str)
Source code in mibiscreen/analysis/reduction/ordination.py
def cca(data_frame,
        independent_variables,
        dependent_variables,
        n_comp = 2,
        verbose = False,
        ):
    """Function that performs Canonical Correspondence Analysis.

    Function makes use of skbio.stats.ordination.CCA on the input data and gives
    the site scores and loadings.

    Input
    -----
        data_frame : pd.dataframe
            Tabular data containing variables to be evaluated with standard
            column names and rows of sample data.
        independent_variables : list of strings
            list with column names data to be the independent variables (=environment)
        dependent_variables : list of strings
            list with column names data to be the dependen variables (=species)
        n_comp : int, default is 2
            number of dimensions to return
        verbose : Boolean, The default is False.
            Set to True to get messages in the Console about the status of the run code.

    Output
    ------
        results : Dictionary
            * method: name of ordination method (str)
            * loadings_independent: loadings of independent variables (np.ndarray)
            * loadings_dependent: loadings of dependent variables (np.ndarray)
            * names_independent: names of independent varialbes (list of str)
            * names_dependent: names of dependent varialbes (list of str)
            * scores: scores (np.ndarray)
            * sample_index: names of samples (list of str)
    """
    if verbose:
        print('==============================================================')
        print(" Running function 'cca()' on data")
        print('==============================================================')

    results = constrained_ordination(data_frame,
                           independent_variables,
                           dependent_variables,
                           method = 'cca',
                           n_comp = n_comp,
                           )
    return results

constrained_ordination(data_frame, independent_variables, dependent_variables, method='cca', n_comp=2)

Function that performs constrained ordination.

Function makes use of skbio.stats.ordination on the input data and gives the scores and loadings.

Input
data_frame : pd.DataFrame
    Tabular data containing variables to be evaluated with standard
    column names and rows of sample data.
independent_variables : list of strings
   list with column names data to be the independent variables (=environment)
dependent_variables : list of strings
   list with column names data to be the dependen variables (=species)
method : string, default is cca
    specification of ordination method of choice. Options 'cca' & 'rda'
n_comp : int, default is 2
    number of dimensions to return
Output
results : Dictionary
    * method: name of ordination method (str)
    * loadings_independent: loadings of independent variables (np.ndarray)
    * loadings_dependent: loadings of dependent variables (np.ndarray)
    * names_independent: names of independent varialbes (list of str)
    * names_dependent: names of dependent varialbes (list of str)
    * scores: scores (np.ndarray)
    * sample_index: names of samples (list of str)
Source code in mibiscreen/analysis/reduction/ordination.py
def constrained_ordination(data_frame,
                           independent_variables,
                           dependent_variables,
                           method = 'cca',
                           n_comp = 2,
        ):
    """Function that performs constrained ordination.

    Function makes use of skbio.stats.ordination on the input data and gives
    the scores and loadings.

    Input
    -----
        data_frame : pd.DataFrame
            Tabular data containing variables to be evaluated with standard
            column names and rows of sample data.
        independent_variables : list of strings
           list with column names data to be the independent variables (=environment)
        dependent_variables : list of strings
           list with column names data to be the dependen variables (=species)
        method : string, default is cca
            specification of ordination method of choice. Options 'cca' & 'rda'
        n_comp : int, default is 2
            number of dimensions to return

    Output
    ------
        results : Dictionary
            * method: name of ordination method (str)
            * loadings_independent: loadings of independent variables (np.ndarray)
            * loadings_dependent: loadings of dependent variables (np.ndarray)
            * names_independent: names of independent varialbes (list of str)
            * names_dependent: names of dependent varialbes (list of str)
            * scores: scores (np.ndarray)
            * sample_index: names of samples (list of str)
    """
    data,cols= check_data_frame(data_frame)

    intersection = _extract_variables(cols,
                          independent_variables,
                          name_variables = 'independent variables'
                          )
    data_independent_variables = data[intersection]

    intersection = _extract_variables(cols,
                          dependent_variables,
                          name_variables = 'dependent variables'
                          )
    data_dependent_variables = data[intersection]

    # Checking if the dimensions of the dataframe allow for CCA
    if (data_dependent_variables.shape[0] < data_dependent_variables.shape[1]) or \
        (data_independent_variables.shape[0] < data_independent_variables.shape[1]):
        raise ValueError("Ordination method {} not possible with more variables than samples.".format(method))

    # Performing constrained ordination using function from scikit-bio.
    if method == 'cca':
        try:
            sci_ordination = sciord.cca(data_dependent_variables, data_independent_variables, scaling = n_comp)
        except(ValueError):
            raise ValueError("There are rows which only contain zero values.\
                             Consider other option for data filtering and/or standardization.")
        except(TypeError):
            raise TypeError("Not all column values are numeric values. Consider standardizing data first.")
    elif method == 'rda':
        try:
            sci_ordination = sciord.rda(data_dependent_variables, data_independent_variables, scaling = n_comp)
        except(TypeError,ValueError):
            raise TypeError("Not all column values are numeric values. Consider standardizing data first.")
    else:
        raise ValueError("Ordination method {} not a valid option.".format(method))

    loadings_independent = sci_ordination.biplot_scores.to_numpy()[:,0:n_comp]
    loadings_dependent = sci_ordination.features.to_numpy()[:,0:n_comp]
    scores = sci_ordination.samples.to_numpy()[:,0:n_comp]

    if loadings_independent.shape[1]<n_comp:
        raise ValueError("Number of dependent variables too small.")

    results = {"method": method,
                "loadings_dependent": loadings_dependent,
                "loadings_independent": loadings_independent,
                "names_independent" : data_independent_variables.columns.to_list(),
                "names_dependent" : data_dependent_variables.columns.to_list(),
                "scores": scores,
                "sample_index" : list(data.index),
                }

    return results

pca(data_frame, independent_variables=False, dependent_variables=False, n_comp=2, verbose=False)

Function that performs Principal Component Analysis.

Makes use of routine sklearn.decomposition.PCA on the input data and gives the site scores and loadings.

Principal component analysis (PCA) is a linear dimensionality reduction technique with applications in exploratory data analysis, visualization and data preprocessing. The data is linearly transformed onto a new coordinate system such that the directions (principal components) capturing the largest variation in the data can be easily identified.

Input
data_frame : pd.dataframe
    Tabular data containing variables to be evaluated with standard
    column names and rows of sample data.
independent_variables : Boolean or list of strings; default False
    list with column names to select from data_frame
    being characterized as independent variables (= environment)
dependent_variables : Boolean or list of strings; default is False
    list with column names to select from data_frame
    being characterized as dependent variables (= species)
n_comp : int, default is 2
    Number of components to report
verbose : Boolean, The default is False.
   Set to True to get messages in the Console about the status of the run code.
Output
results : Dictionary
    containing the scores and loadings of the PCA,
    the percentage of the variation explained by the first principal components,
    the correlation coefficient between the first two PCs,
    names of columns (same length as loadings)
    names of indices (same length as scores)
Source code in mibiscreen/analysis/reduction/ordination.py
def pca(data_frame,
        independent_variables = False,
        dependent_variables = False,
        n_comp = 2,
        verbose = False,
        ):
    """Function that performs Principal Component Analysis.

    Makes use of routine sklearn.decomposition.PCA on the input data and gives
    the site scores and loadings.

    Principal component analysis (PCA) is a linear dimensionality reduction
    technique with applications in exploratory data analysis, visualization
    and data preprocessing. The data is linearly transformed onto a new
    coordinate system such that the directions (principal components) capturing
    the largest variation in the data can be easily identified.

    Input
    -----
        data_frame : pd.dataframe
            Tabular data containing variables to be evaluated with standard
            column names and rows of sample data.
        independent_variables : Boolean or list of strings; default False
            list with column names to select from data_frame
            being characterized as independent variables (= environment)
        dependent_variables : Boolean or list of strings; default is False
            list with column names to select from data_frame
            being characterized as dependent variables (= species)
        n_comp : int, default is 2
            Number of components to report
        verbose : Boolean, The default is False.
           Set to True to get messages in the Console about the status of the run code.

    Output
    ------
        results : Dictionary
            containing the scores and loadings of the PCA,
            the percentage of the variation explained by the first principal components,
            the correlation coefficient between the first two PCs,
            names of columns (same length as loadings)
            names of indices (same length as scores)
    """
    if verbose:
        print('==============================================================')
        print(" Running function 'pca()' on data")
        print('==============================================================')

    data,cols= check_data_frame(data_frame)

    if independent_variables is False and dependent_variables is False:
        data_pca = data
        names_independent = cols
        names_dependent = []

    elif independent_variables is not False and dependent_variables is False:
        names_independent = _extract_variables(cols,
                              independent_variables,
                              name_variables = 'independent variables'
                              )
        names_dependent = []
        data_pca = data[names_independent]
    elif independent_variables is False and dependent_variables is not False:
        names_dependent = _extract_variables(cols,
                              dependent_variables,
                              name_variables = 'dependent variables'
                              )
        names_independent = []
        data_pca = data[names_dependent]

    else:
        names_independent = _extract_variables(cols,
                              independent_variables,
                              name_variables = 'independent variables'
                              )
        names_dependent = _extract_variables(cols,
                              dependent_variables,
                              name_variables = 'dependent variables'
                              )
        data_pca = data[names_independent + names_dependent]

    # Checking if the dimensions of the dataframe allow for PCA
    if data_pca.shape[0] < data_pca.shape[1]:
        raise ValueError("PCA not possible with more variables than samples.")

    try:
        # Using scikit.decomposoition.PCA with an amount of components equal
        # to the amount of variables, then getting the loadings, scores and explained variance ratio.
        pca = decomposition.PCA(n_components=len(data_pca.columns))
        pca.fit(data_pca)
        loadings = pca.components_.T
        PCAscores = pca.transform(data_pca)
        variances = pca.explained_variance_ratio_
    except(ValueError,TypeError):
        raise TypeError("Not all column values are numeric values (or NaN). Consider standardizing data first.")

    # Taking the first two PC for plotting
    if dependent_variables is False:
        loadings_independent = loadings[:, 0:n_comp]
        loadings_dependent = np.array([[],[]]).T
    else:
        loadings_independent = loadings[:-len(names_dependent), 0:n_comp]
        loadings_dependent = loadings[-len(names_dependent):, 0:n_comp]
    scores = PCAscores[:, 0:n_comp]
    percent_explained = np.around(100*variances/np.sum(variances), decimals=2)
    coef = np.corrcoef(scores[:,0], scores[:,1])[0,1]

    if verbose:
        print("Information about the success of the PCA:")
        print('----------------------------------------------------------------')
        for i in range(len(percent_explained)):
            print('Principle component {} explains {}% of the total variance.'.format(i,percent_explained[i]))
        print('\nThe correlation coefficient between PC1 and PC2 is {:.2e}.'.format(coef))
        print('----------------------------------------------------------------')

    results = {"method": 'pca',
               "loadings_dependent": loadings_dependent,
               "loadings_independent": loadings_independent,
               "names_independent" : names_independent,
               "names_dependent" : names_dependent,
               "scores": scores,
               "sample_index" : list(data_pca.index),
               "percent_explained": percent_explained,
               "corr_PC1_PC2": coef,
               }

    return results

rda(data_frame, independent_variables, dependent_variables, n_comp=2, verbose=False)

Function that performs Redundancy Analysis.

Function makes use of skbio.stats.ordination.RDA on the input data and gives the site scores and loadings.

Input
data_frame : pd.dataframe
    Tabular data containing variables to be evaluated with standard
    column names and rows of sample data.
independent_variables : list of strings
    list with column names data to be the independent variables (=envirnoment)
dependent_variables : list of strings
    list with column names data to be the dependent variables (=species)
n_comp : int, default is 2
    number of dimensions to return
verbose : Boolean, The default is False.
    Set to True to get messages in the Console about the status of the run code.
Output
results : Dictionary
    * method: name of ordination method (str)
    * loadings_independent: loadings of independent variables (np.ndarray)
    * loadings_dependent: loadings of dependent variables (np.ndarray)
    * names_independent: names of independent varialbes (list of str)
    * names_dependent: names of dependent varialbes (list of str)
    * scores: scores (np.ndarray)
    * sample_index: names of samples (list of str)
Source code in mibiscreen/analysis/reduction/ordination.py
def rda(data_frame,
        independent_variables,
        dependent_variables,
        n_comp = 2,
        verbose = False,
        ):
    """Function that performs Redundancy Analysis.

    Function makes use of skbio.stats.ordination.RDA on the input data and gives
    the site scores and loadings.

    Input
    -----
        data_frame : pd.dataframe
            Tabular data containing variables to be evaluated with standard
            column names and rows of sample data.
        independent_variables : list of strings
            list with column names data to be the independent variables (=envirnoment)
        dependent_variables : list of strings
            list with column names data to be the dependent variables (=species)
        n_comp : int, default is 2
            number of dimensions to return
        verbose : Boolean, The default is False.
            Set to True to get messages in the Console about the status of the run code.

    Output
    ------
        results : Dictionary
            * method: name of ordination method (str)
            * loadings_independent: loadings of independent variables (np.ndarray)
            * loadings_dependent: loadings of dependent variables (np.ndarray)
            * names_independent: names of independent varialbes (list of str)
            * names_dependent: names of dependent varialbes (list of str)
            * scores: scores (np.ndarray)
            * sample_index: names of samples (list of str)
    """
    if verbose:
        print('==============================================================')
        print(" Running function 'rda()' on data")
        print('==============================================================')

    results = constrained_ordination(data_frame,
                           independent_variables,
                           dependent_variables,
                           method = 'rda',
                           n_comp = n_comp,
                           )
    return results

stable_isotope_regression

Routines for performing linear regression on isotope data.

@author: Alraune Zech

Keeling_regression(concentration, delta_mix=None, relative_abundance=None, validate_indices=True, verbose=False, **kwargs)

Performing a linear regression linked to the Keeling plot.

A Keeling fit/plot is an approach to identify the isotopic composition of a contaminating source from measured concentrations and isotopic composition (delta) of a target species in the mix of the source and a pool.

It is based on the linear relationship of the given quantities (concentration) and delta-values (or alternatively the relative abundance x) which are measured over time or across a spatial interval according to

delta_mix = delta_source + m * 1/c_mix

where m is the slope relating the isotopic quantities of the pool (which mixes with the sourse) by m = (delta_pool + delta_source)*c_pool.

The analysis is based on a linear regression of the inverse concentration data against the delta (or x)-values. The parameter of interest, the delta (or relative_abundance, respectively) of the source quantity is the intercept of linear fit with the y-axis, or in other words, the absolute value of the linear fit function.

A plot of the results with data and linear trendline can be generate with the method Keeling_plot() [in the module visualize].

Note that the approach is only applicable if (i) the isotopic composition of the unknown source is constant (ii) the concentration and isotopic composition of the target compound is constant (over time or across space) (i.e. in absence of contamination from the unknown source)

Input
concentration : np.array, pd.dataframe
    total molecular mass/molar concentration of target substance
    at different locations (at a time) or at different times (at one location)
delta_mix : np.array, pd.dataframe (same length as c_mix), default None
    relative isotope ratio (delta-value) of target substance
relative_abundance : None or np.array, pd.dataframe (same length as c_mix), default None
    if not None it replaces delta_mix in the inverse estimation and plotting
    relative abundance of target substance
validate_indices: boolean, default True
    flag to run index validation (i.e. removal of nan and infinity values)
verbose : Boolean, The default is False.
   Set to True to get messages in the Console about the status of the run code.
**kwargs : dict
    keywordarguments dictionary, e.g. for passing forward keywords to
    valid_indices()
Returns
results : dict
    results of fitting, including:
        * coefficients : array/list of lenght 2, where coefficients[0]
            is the slope of the linear fit and coefficient[1] is the
            intercept of linear fit with y-axis, reflecting delta
            (or relative_abundance, respectively) of the source quantity
        * delta_C: np.array with isotope used for fitting - all samples
            where non-zero values are available for delta_C and delta_H
        * delta_H: np.array with isotope used for fitting - all samples
            where non-zero values are available for delta_C and delta_H
Source code in mibiscreen/analysis/reduction/stable_isotope_regression.py
def Keeling_regression(concentration,
                       delta_mix = None,
                       relative_abundance = None,
                       validate_indices = True,
                       verbose = False,
                       **kwargs,
                       ):
    """Performing a linear regression linked to the Keeling plot.

    A Keeling fit/plot is an approach to identify the isotopic composition of a
    contaminating source from measured concentrations and isotopic composition
    (delta) of a target species in the mix of the source and a pool.

    It is based on the linear relationship of the given quantities (concentration)
    and delta-values (or alternatively the relative abundance x) which are measured
    over time or across a spatial interval according to

        delta_mix = delta_source + m * 1/c_mix

    where m is the slope relating the isotopic quantities of the pool (which mixes
    with the sourse) by m = (delta_pool + delta_source)*c_pool.

    The analysis is based on a linear regression of the inverse concentration
    data against the delta (or x)-values. The parameter of interest, the delta
    (or relative_abundance, respectively) of the source quantity is the
    intercept of linear fit with the y-axis, or in other words, the absolute
    value of the linear fit function.

    A plot of the results with data and linear trendline can be generate with the
    method Keeling_plot() [in the module visualize].

    Note that the approach is only applicable if
        (i)  the isotopic composition of the unknown source is constant
        (ii) the concentration and isotopic composition of the target compound
            is constant (over time or across space)
            (i.e. in absence of contamination from the unknown source)

    Input
    -----
        concentration : np.array, pd.dataframe
            total molecular mass/molar concentration of target substance
            at different locations (at a time) or at different times (at one location)
        delta_mix : np.array, pd.dataframe (same length as c_mix), default None
            relative isotope ratio (delta-value) of target substance
        relative_abundance : None or np.array, pd.dataframe (same length as c_mix), default None
            if not None it replaces delta_mix in the inverse estimation and plotting
            relative abundance of target substance
        validate_indices: boolean, default True
            flag to run index validation (i.e. removal of nan and infinity values)
        verbose : Boolean, The default is False.
           Set to True to get messages in the Console about the status of the run code.
        **kwargs : dict
            keywordarguments dictionary, e.g. for passing forward keywords to
            valid_indices()

    Returns
    -------
        results : dict
            results of fitting, including:
                * coefficients : array/list of lenght 2, where coefficients[0]
                    is the slope of the linear fit and coefficient[1] is the
                    intercept of linear fit with y-axis, reflecting delta
                    (or relative_abundance, respectively) of the source quantity
                * delta_C: np.array with isotope used for fitting - all samples
                    where non-zero values are available for delta_C and delta_H
                * delta_H: np.array with isotope used for fitting - all samples
                    where non-zero values are available for delta_C and delta_H
    """
    if verbose:
        print('==============================================================')
        print(" Running function 'Keeling_regression()' on data")
        print('==============================================================')

    if delta_mix is not None:
        y = delta_mix
        text = 'delta'
    elif relative_abundance is not None:
        y = relative_abundance
        text = 'relative abundance'
    else:
        raise ValueError("One of the quantities 'delta_mix' or 'relative_abundance' must be provided")

    ### ---------------------------------------------------------------------------
    ### check length of data arrays and remove non-valid values (NaN, inf & zero)

    if validate_indices:
        data1, data2 = valid_indices(concentration,
                                 y,
                                 remove_nan = True,
                                 remove_infinity = True,
                                 remove_zero = True,
                                 **kwargs,
                                 )
    else:
        data1, data2 = concentration,y

    ### ---------------------------------------------------------------------------
    ### perform linear regression

    coefficients = np.polyfit(1./data1, data2, 1)

    if verbose:
        print("The {} of the source quantity, being the intercept".format(text))
        print("of the linear fit, is identified with {:.2f}".format(coefficients[1]))
        print('______________________________________________________________')

    results = dict(
        concentration = data1,
        delta = data2,
        coefficients = coefficients,
        )

    return results

Lambda_regression(delta_C, delta_H, validate_indices=True, verbose=False, **kwargs)

Performing linear regression to achieve Lambda value.

The Lambda values relates the δ13C versus δ2H signatures of a chemical compound. Relative changes in the ratio can indicate the occurrence of specific enzymatic degradation reactions.

The analysis is based on a linear regression of the hydrogen versus carbon isotope signatures. The parameter of interest, the Lambda values is the slope of the the linear trend line.

A plot of the results with data and linear trendline can be generate with the method Lambda_plot() [in the module visualize].

Input
delta_C : np.array, pd.series
    relative isotope ratio (delta-value) of carbon of target molecule
delta_H : np.array, pd.series (same length as delta_C)
    relative isotope ratio (delta-value) of hydrogen of target molecule
validate_indices: boolean, default True
    flag to run index validation (i.e. removal of nan and infinity values)
verbose : Boolean, The default is False.
   Set to True to get messages in the Console about the status of the run code.
**kwargs : dict
    keywordarguments dictionary, e.g. for passing forward keywords to
    valid_indices()
Returns
results : dict
    results of fitting, including:
        * coefficients : array/list of lenght 2, where coefficients[0]
            is the slope of the linear fit, reflecting the lambda values
            and coefficient[1] is the absolute value of the linear function
        * delta_C: np.array with isotope used for fitting - all samples
            where non-zero values are available for delta_C and delta_H
        * delta_H: np.array with isotope used for fitting - all samples
            where non-zero values are available for delta_C and delta_H
Source code in mibiscreen/analysis/reduction/stable_isotope_regression.py
def Lambda_regression(delta_C,
                      delta_H,
                      validate_indices = True,
                      verbose = False,
                      **kwargs,
                      ):
    """Performing linear regression to achieve Lambda value.

    The Lambda values relates the δ13C versus δ2H signatures of a chemical
    compound. Relative changes in the ratio can indicate the occurrence of
    specific enzymatic degradation reactions.

    The analysis is based on a linear regression of the hydrogen versus
    carbon isotope signatures. The parameter of interest, the Lambda values
    is the slope of the the linear trend line.

    A plot of the results with data and linear trendline can be generate with the
    method Lambda_plot() [in the module visualize].

    Input
    -----
        delta_C : np.array, pd.series
            relative isotope ratio (delta-value) of carbon of target molecule
        delta_H : np.array, pd.series (same length as delta_C)
            relative isotope ratio (delta-value) of hydrogen of target molecule
        validate_indices: boolean, default True
            flag to run index validation (i.e. removal of nan and infinity values)
        verbose : Boolean, The default is False.
           Set to True to get messages in the Console about the status of the run code.
        **kwargs : dict
            keywordarguments dictionary, e.g. for passing forward keywords to
            valid_indices()

    Returns
    -------
        results : dict
            results of fitting, including:
                * coefficients : array/list of lenght 2, where coefficients[0]
                    is the slope of the linear fit, reflecting the lambda values
                    and coefficient[1] is the absolute value of the linear function
                * delta_C: np.array with isotope used for fitting - all samples
                    where non-zero values are available for delta_C and delta_H
                * delta_H: np.array with isotope used for fitting - all samples
                    where non-zero values are available for delta_C and delta_H
    """
    ### ---------------------------------------------------------------------------
    ### check length of data arrays and remove non-valid values (NaN, inf & zero)

    if verbose:
        print('==============================================================')
        print(" Running function 'Lambda_regression()' on data")
        print('==============================================================')

    if validate_indices:
        data1, data2 = valid_indices(delta_C,
                                     delta_H,
                                     remove_nan = True,
                                     remove_infinity = True,
                                     remove_zero=True,
                                     )
    else:
        data1, data2 = delta_C,delta_H

    ### ---------------------------------------------------------------------------
    ### perform linear regression

    coefficients = np.polyfit(data1, data2, 1)

    if verbose:
        print("Lambda value, being the slope of the linear fit is \n identified with {:.2f}".format(coefficients[0]))
        print('______________________________________________________________')

    results = dict(
        delta_C = data1,
        delta_H = data2,
        coefficients = coefficients,
        )

    return results

Rayleigh_fractionation(concentration, delta, validate_indices=True, verbose=False, **kwargs)

Performing Rayleigh fractionation analysis.

Rayleigh fractionation is a common application to characterize the removal of a substance from a finite pool using stable isotopes. It is based on the change in the isotopic composition of the pool due to different kinetics of the change in lighter and heavier isotopes.

We follow the most simple approach assuming that the substance removal follows first-order kinetics, where the rate coefficients for the lighter and heavier isotopes of the substance differ due to kinetic isotope fractionation effects. The isotopic composition of the remaining substance in the pool will change over time, leading to the so-called Rayleigh fractionation.

The analysis is based on a linear regression of the log-transformed concentration data against the delta-values. The parameter of interest, the kinetic fractionation factor (epsilon or alpha -1) of the removal process is the slope of the the linear trend line.

A plot of the results with data and linear trendline can be generate with the method Rayleigh_fractionation_plot() [in the module visualize].

Input
concentration : np.array, pd.dataframe
    total molecular mass/molar concentration of target substance
    at different locations (at a time) or at different times (at one location)
delta : np.array, pd.dataframe (same length as concentration)
    relative isotope ratio (delta-value) of target substance
validate_indices: boolean, default True
    flag to run index validation (i.e. removal of nan and infinity values)
verbose : Boolean, The default is False.
   Set to True to get messages in the Console about the status of the run code.
**kwargs : dict
    keywordarguments dictionary, e.g. for passing forward keywords to
    valid_indices()
Returns
results : dict
    results of fitting, including:
        * coefficients : array/list of lenght 2, where coefficients[0]
            is the slope of the linear fit, reflecting the kinetic
            fractionation factor (epsilon or alpha -1) of the removal process
            and coefficient[1] is the absolute value of the linear function
        * delta_C: np.array with isotope used for fitting - all samples
            where non-zero values are available for delta_C and delta_H
        * delta_H: np.array with isotope used for fitting - all samples
            where non-zero values are available for delta_C and delta_H
Source code in mibiscreen/analysis/reduction/stable_isotope_regression.py
def Rayleigh_fractionation(concentration,
                           delta,
                           validate_indices = True,
                           verbose = False,
                           **kwargs,
                           ):
    """Performing Rayleigh fractionation analysis.

    Rayleigh fractionation is a common application to characterize the removal
    of a substance from a finite pool using stable isotopes. It is based on the
    change in the isotopic composition of the pool due to different kinetics of
    the change in lighter and heavier isotopes.

    We follow the most simple approach assuming that the substance removal follows
    first-order kinetics, where the rate coefficients for the lighter and heavier
    isotopes of the substance differ due to kinetic isotope fractionation effects.
    The isotopic composition of the remaining substance in the pool will change
    over time, leading to the so-called Rayleigh fractionation.

    The analysis is based on a linear regression of the log-transformed concentration
    data against the delta-values. The parameter of interest, the kinetic
    fractionation factor (epsilon or alpha -1) of the removal process is the slope
    of the the linear trend line.

    A plot of the results with data and linear trendline can be generate with the
    method Rayleigh_fractionation_plot() [in the module visualize].

    Input
    -----
        concentration : np.array, pd.dataframe
            total molecular mass/molar concentration of target substance
            at different locations (at a time) or at different times (at one location)
        delta : np.array, pd.dataframe (same length as concentration)
            relative isotope ratio (delta-value) of target substance
        validate_indices: boolean, default True
            flag to run index validation (i.e. removal of nan and infinity values)
        verbose : Boolean, The default is False.
           Set to True to get messages in the Console about the status of the run code.
        **kwargs : dict
            keywordarguments dictionary, e.g. for passing forward keywords to
            valid_indices()

    Returns
    -------
        results : dict
            results of fitting, including:
                * coefficients : array/list of lenght 2, where coefficients[0]
                    is the slope of the linear fit, reflecting the kinetic
                    fractionation factor (epsilon or alpha -1) of the removal process
                    and coefficient[1] is the absolute value of the linear function
                * delta_C: np.array with isotope used for fitting - all samples
                    where non-zero values are available for delta_C and delta_H
                * delta_H: np.array with isotope used for fitting - all samples
                    where non-zero values are available for delta_C and delta_H
    """
    ### ---------------------------------------------------------------------------
    ### check length of data arrays and remove non-valid values (NaN, inf & zero)
    if verbose:
        print('==============================================================')
        print(" Running function 'Rayleigh_fractionation()' on data")
        print('==============================================================')

    if validate_indices:
        data1, data2 = valid_indices(concentration,
                                 delta,
                                 remove_nan = True,
                                 remove_infinity = True,
                                 remove_zero = True,
                                 **kwargs,
                                 )
    else:
        data1, data2 = concentration,delta

    ### ---------------------------------------------------------------------------
    ### perform linear regression
    if np.any(data1<=0):
        raise ValueError("Concentration data provided is negative, but has to be positive.")

    coefficients = np.polyfit(np.log(data1), data2, 1)

    if verbose:
        print("The kinetic fractionation factor ('epsilon' or 'alpha-1') of")
        print("the removal process, being the slope of the linear fit, is ")
        print("identified with {:.2f}".format(coefficients[0]))
        print('______________________________________________________________')

    results = dict(
        concentration = data1,
        delta = data2,
        coefficients = coefficients,
        )

    return results

extract_isotope_data(df, molecule, name_13C=names.name_13C, name_2H=names.name_2H)

Extracts isotope data from standardised input-dataframe.

Parameters

df : pd.dataframe numeric (observational) data molecule : str name of contaminant molecule to extract isotope data for name_13C : str, default ‘delta_13C’ (standard name) name of C13 isotope to extract data for name_2H : str, default ‘delta_2H’ (standard name) name of deuterium isotope to extract data for

Returns

C_data : np.array numeric isotope data H_data : np.array numeric isotope data

Source code in mibiscreen/analysis/reduction/stable_isotope_regression.py
def extract_isotope_data(df,
                         molecule,
                         name_13C = names.name_13C,
                         name_2H = names.name_2H,
                         ):
    """Extracts isotope data from standardised input-dataframe.

    Parameters
    ----------
    df : pd.dataframe
        numeric (observational) data
    molecule : str
        name of contaminant molecule to extract isotope data for
    name_13C : str, default 'delta_13C' (standard name)
        name of C13 isotope to extract data for
    name_2H : str, default 'delta_2H' (standard name)
        name of deuterium isotope to extract data for

    Returns
    -------
    C_data : np.array
        numeric isotope data
    H_data : np.array
        numeric isotope data

    """
    other_names_contaminants = _generate_dict_other_names(properties_contaminants)
    other_names_isotopes = _generate_dict_other_names(properties_isotopes)

    molecule_standard = other_names_contaminants.get(molecule.lower(), False)
    isotope_13C = other_names_isotopes.get(name_13C.lower(), False)
    isotope_2H = other_names_isotopes.get(name_2H.lower(), False)

    if molecule_standard is False:
        raise ValueError("Contaminant (name) unknown: {}".format(molecule))
    if isotope_13C is False:
        raise ValueError("Isotope (name) unknown: {}".format(name_13C))
    if isotope_2H is False:
        raise ValueError("Isotope (name) unknown: {}".format(name_2H))

    name_C = '{}-{}'.format(isotope_13C,molecule_standard)
    name_H = '{}-{}'.format(isotope_2H,molecule_standard)

    if name_C not in df.columns.to_list():
        raise ValueError("No isotope data available for : {}".format(name_C))
    if name_H not in df.columns.to_list():
        raise ValueError("No isotope data available for : {}".format(name_H))

    C_data = df[name_C].values
    H_data = df[name_H].values

    return C_data, H_data

valid_indices(data1, data2, remove_nan=True, remove_infinity=True, remove_zero=False, **kwargs)

Identifies valid indices in two equaly long arrays and compresses both.

Optional numerical to remove from array are: nan, infinity and zero values.

Parameters

data1 : np.array or pd.series numeric data data2 : np.array or pd.series (same len/shape as data1) numeric data remove_nan : boolean, default True flag to remove nan-values remove_infinity : boolean, default True flag to remove infinity values remove_zero : boolean, default False flag to remove zero values **kwargs : dict keywordarguments dictionary

Returns

data1 : np.array or pd.series numeric data of reduced length where only data at valid indices is in data2 : np.array or pd.series numeric data of reduced length where only data at valid indices is in

Source code in mibiscreen/analysis/reduction/stable_isotope_regression.py
def valid_indices(data1,
                  data2,
                  remove_nan = True,
                  remove_infinity = True,
                  remove_zero = False,
                  **kwargs,
                  ):
    """Identifies valid indices in two equaly long arrays and compresses both.

    Optional numerical to remove from array are: nan, infinity and zero values.

    Parameters
    ----------
    data1 : np.array or pd.series
        numeric data
    data2 : np.array or pd.series (same len/shape as data1)
        numeric data
    remove_nan : boolean, default True
        flag to remove nan-values
    remove_infinity : boolean, default True
        flag to remove infinity values
    remove_zero : boolean, default False
        flag to remove zero values
    **kwargs : dict
        keywordarguments dictionary

    Returns
    -------
    data1 : np.array or pd.series
        numeric data of reduced length where only data at valid indices is in
    data2 : np.array or pd.series
        numeric data of reduced length where only data at valid indices is in

    """
    if data1.shape != data2.shape:
        raise ValueError("Shape of provided data must be identical.")

    valid_indices = np.full(data1.shape, True, dtype=bool)

    if remove_nan:
        valid_indices *= ~np.isnan(data1) & ~np.isinf(data1)
    if remove_infinity:
        valid_indices *= ~np.isnan(data2) & ~np.isinf(data2)
    if remove_zero:
        valid_indices *= (data1 != 0) & (data2 != 0)

    return data1[valid_indices],data2[valid_indices]

transformation

Routines for performing ordination statistics on sample data.

@author: Alraune Zech, Jorrit Bakker

filter_values(data_frame, replace_NaN='remove', drop_rows=[], inplace=False, verbose=False)

Filtering values of dataframes for ordination to assure all are numeric.

Ordination methods require all cells to be filled. This method checks the provided data frame if values are missing/NaN or not numeric and handles missing/NaN values accordingly.

It then removes select rows and mutates the cells containing NULL values based on the input parameters.

Input
data_frame : pd.dataframe
    Tabular data containing variables to be evaluated with standard
    column names and rows of sample data.
replace_NaN : string or float, default "remove"
    Keyword specifying how to handle missing/NaN/non-numeric values, options:
        - remove: remove rows with missing values
        - zero: replace values with 0.0
        - average: replace the missing values with the average of the variable
                    (using all other available samples)
        - median: replace the missing values with the median of the variable
                                (using all other available samples)
        - float-value: replace all empty cells with that numeric value
drop_rows : List, default [] (empty list)
    List of rows that should be removed from dataframe.
inplace: bool, default True
    If False, return a copy. Otherwise, do operation in place.
verbose : Boolean, The default is False.
   Set to True to get messages in the Console about the status of the run code.
Output
data_filtered : pd.dataframe
    Tabular data containing filtered data.
Source code in mibiscreen/analysis/reduction/transformation.py
def filter_values(data_frame,
                  replace_NaN = 'remove',
                  drop_rows = [],
                  inplace = False,
                  verbose = False):
    """Filtering values of dataframes for ordination to assure all are numeric.

    Ordination methods require all cells to be filled. This method checks the
    provided data frame if values are missing/NaN or not numeric and handles
    missing/NaN values accordingly.

    It then removes select rows and mutates the cells containing NULL values based
    on the input parameters.

    Input
    -----
        data_frame : pd.dataframe
            Tabular data containing variables to be evaluated with standard
            column names and rows of sample data.
        replace_NaN : string or float, default "remove"
            Keyword specifying how to handle missing/NaN/non-numeric values, options:
                - remove: remove rows with missing values
                - zero: replace values with 0.0
                - average: replace the missing values with the average of the variable
                            (using all other available samples)
                - median: replace the missing values with the median of the variable
                                        (using all other available samples)
                - float-value: replace all empty cells with that numeric value
        drop_rows : List, default [] (empty list)
            List of rows that should be removed from dataframe.
        inplace: bool, default True
            If False, return a copy. Otherwise, do operation in place.
        verbose : Boolean, The default is False.
           Set to True to get messages in the Console about the status of the run code.

    Output
    ------
        data_filtered : pd.dataframe
            Tabular data containing filtered data.
    """
    data,cols= check_data_frame(data_frame,inplace = inplace)

    if verbose:
        print("==============================================================================")
        print('Perform filtering of values since ordination requires all values to be numeric.')

    if len(drop_rows)>0:
        data.drop(drop_rows, inplace = True)
        if verbose:
            print('The samples of rows {} have been removed'.format(drop_rows))

    # Identifying which rows and columns contain any amount of NULL cells and putting them in a list.
    NaN_rows = data[data.isna().any(axis=1)].index.tolist()
    NaN_cols = data.columns[data.isna().any()].tolist()

    # If there are any rows containing NULL cells, the NULL values will be filtered
    if len(NaN_rows)>0:
        if replace_NaN == 'remove':
            data.drop(NaN_rows, inplace = True)
            text = 'The sample row(s) have been removed since they contain NaN values: {}'.format(NaN_rows)
        elif replace_NaN == 'zero':
            set_NaN = 0.0
            data.fillna(set_NaN, inplace = True)
            text = 'The values of the empty cells have been set to zero (0.0)'
        elif isinstance(replace_NaN, (float, int)):
            set_NaN = float(replace_NaN)
            data.fillna(set_NaN, inplace = True)
            text = 'The values of the empty cells have been set to the value of {}'.format(set_NaN)
        elif replace_NaN == "average":
            for var in NaN_cols:
                data[var] = data[var].fillna(data[var].mean(skipna = True))
            text = 'The values of the empty cells have been replaced by the average of\
                  the corresponding variables (using all other available samples).'
        elif replace_NaN == "median":
            for var in NaN_cols:
                data[var] = data[var].fillna(data[var].median(skipna = True))
            text = 'The values of the empty cells have been replaced by the median of\
                  the corresponding variables (using all other available samples).'
        else:
            raise ValueError("Value of 'replace_NaN' unknown: {}".format(replace_NaN))
    else:
        text = 'No data to be filtered out.'

    if verbose:
        print(text)

    return data

transform_values(data_frame, name_list='all', how='log_scale', log_scale_A=1, log_scale_B=1, inplace=False, verbose=False)

Extracting data from dataframe for specified variables.


data_frame: pandas.DataFrames
    dataframe with the measurements
name_list: string or list of strings, default 'all'
    list of quantities (column names) to perfrom transformation on
how: string, default 'standardize'
    Type of transformation:
        * standardize
        * log_scale
        * center
log_scale_A : Integer or float, default 1
    Log transformation parameter A: log10(Ax+B).
log_scale_B : Integer or float, default 1
    Log transformation parameter B: log10(Ax+B).
inplace: bool, default True
    If False, return a copy. Otherwise, do operation in place and return None.
verbose : Boolean, The default is False.
   Set to True to get messages in the Console about the status of the run code.

data: pd.DataFrame
    dataframe with the measurements
Raises:

None (yet).

Example:

To be added.

Source code in mibiscreen/analysis/reduction/transformation.py
def transform_values(data_frame,
                     name_list = 'all',
                     how = 'log_scale',
                     log_scale_A = 1,
                     log_scale_B = 1,
                     inplace = False,
                     verbose = False,
                     ):
    """Extracting data from dataframe for specified variables.

    Args:
    -------
        data_frame: pandas.DataFrames
            dataframe with the measurements
        name_list: string or list of strings, default 'all'
            list of quantities (column names) to perfrom transformation on
        how: string, default 'standardize'
            Type of transformation:
                * standardize
                * log_scale
                * center
        log_scale_A : Integer or float, default 1
            Log transformation parameter A: log10(Ax+B).
        log_scale_B : Integer or float, default 1
            Log transformation parameter B: log10(Ax+B).
        inplace: bool, default True
            If False, return a copy. Otherwise, do operation in place and return None.
        verbose : Boolean, The default is False.
           Set to True to get messages in the Console about the status of the run code.

    Returns:
    -------
        data: pd.DataFrame
            dataframe with the measurements

    Raises:
    -------
    None (yet).

    Example:
    -------
    To be added.
    """
    if verbose:
        print('==============================================================')
        print(" Running function 'transform_values()' on data")
        print('==============================================================')

    data,cols= check_data_frame(data_frame,inplace = inplace)
    ### sorting out which columns in data to use for summation of concentrations
    quantities, _ = determine_quantities(cols,
                                      name_list = name_list,
                                      verbose = verbose)

    for quantity in quantities:
        if how == 'log_scale':
            data[quantity] = np.log10(log_scale_A * data[quantity] + log_scale_B)
        elif how == 'center':
            data[quantity] =  data[quantity]-data[quantity].mean()
        elif how == 'standardize':
            data[quantity] = zscore(data[quantity].values)
        else:
            raise ValueError("Value of 'how' unknown: {}".format(how))

    return data

sample

mibiscreen module for data analysis performed on each sample.

concentrations

Routines for calculating total concentrations and counts for samples.

@author: Alraune Zech

thresholds_for_intervention(data_frame, contaminant_group='BTEXIIN', include=False, verbose=False)

Function to evalute intervention threshold exceedance.

Determines which contaminants exceed concentration thresholds set by
the Dutch government for intervention.
Input
data_frame: pd.DataFrame
    Contaminant contentrations in [ug/l], i.e. microgram per liter
contaminant_group: str
    Short name for group of contaminants to use
    default is 'BTEXIIN' (for benzene, toluene, ethylbenzene, xylene,
                          indene, indane and naphthaline)
include: bool, default False
    Whether to modify the DataFrame rather than creating a new one.
verbose: Boolean, default False
    verbose flag
Output
intervention: pd.DataFrame
    DataFrame of similar format as input data with well specification and
    three columns on intervention threshold exceedance analysis:
        - traffic light if well requires intervention
        - number of contaminants exceeding the intervention value
        - list of contaminants above the threshold of intervention
Source code in mibiscreen/analysis/sample/concentrations.py
def thresholds_for_intervention(
        data_frame,
        contaminant_group = "BTEXIIN",
        include = False,
        verbose = False,
        ):
    """Function to evalute intervention threshold exceedance.

        Determines which contaminants exceed concentration thresholds set by
        the Dutch government for intervention.

    Input
    -----
        data_frame: pd.DataFrame
            Contaminant contentrations in [ug/l], i.e. microgram per liter
        contaminant_group: str
            Short name for group of contaminants to use
            default is 'BTEXIIN' (for benzene, toluene, ethylbenzene, xylene,
                                  indene, indane and naphthaline)
        include: bool, default False
            Whether to modify the DataFrame rather than creating a new one.
        verbose: Boolean, default False
            verbose flag

    Output
    ------
        intervention: pd.DataFrame
            DataFrame of similar format as input data with well specification and
            three columns on intervention threshold exceedance analysis:
                - traffic light if well requires intervention
                - number of contaminants exceeding the intervention value
                - list of contaminants above the threshold of intervention
    """
    if verbose:
        print('==============================================================')
        print(" Running function 'thresholds_for_intervention()' on data")
        print('==============================================================')

    ### check on correct data input format and extracting column names as list
    data,cols= check_data_frame(data_frame,inplace = include)

    ### sorting out which columns in data to evaluate
    quantities, _ = determine_quantities(cols,
                                      name_list = contaminant_group,
                                      verbose = verbose)

    if include:
        intervention = data
    else:
        intervention= extract_settings(data)

    nr_samples = data.shape[0] # number of samples
    traffic_nr = np.zeros(nr_samples,dtype=int)
    traffic_list = [[] for _ in range(nr_samples)]

    try:
        for cont in quantities:
            th_value = properties[cont]['thresholds_for_intervention_NL']
            traffic_nr += (data[cont].values > th_value)
            for i in range(nr_samples):
                if data[cont].values[i] > th_value:
                    traffic_list[i].append(cont)
    except TypeError:
        raise ValueError("Data not in standardized format. Run 'standardize()' first.")

    traffic_light = np.where(traffic_nr>0,"red","green")
    traffic_light[np.isnan(traffic_nr)] = 'y'
    intervention[names.name_intervention_traffic] = traffic_light
    intervention[names.name_intervention_number] = traffic_nr
    intervention[names.name_intervention_contaminants] = traffic_list

    if verbose:
        print("Evaluation of contaminant concentrations exceeding intervention values for {}:".format(
            contaminant_group))
        print('------------------------------------------------------------------------------------')
        print("Red light: Intervention values exceeded for {} out of {} locations".format(
            np.sum(traffic_nr >0),data.shape[0]))
        print("green light: Concentrations below intervention values at {} out of {} locations".format(
            np.sum(traffic_nr == 0),data.shape[0]))
        print("Yellow light: No decision possible at {} out of {} locations".format(
            np.sum(np.isnan(traffic_nr)),data.shape[0]))
        print('________________________________________________________________')

    return intervention

total_concentration(data_frame, name_list='all', name_column=False, verbose=False, include=False, **kwargs)

Calculate total concentration of given list of quantities.

Input
data: pd.DataFrame
    Contaminant concentrations in [ug/l], i.e. microgram per liter
name_list: str or list, dafault is 'all'
    either short name for group of quantities to use, such as:
            - 'all' (all qunatities given in data frame except settings)
            - 'BTEX' (for contaminant group: benzene, toluene, ethylbenzene, xylene)
            - 'BTEXIIN' (for contaminant group: benzene, toluene, ethylbenzene, xylene,
                          indene, indane and naphthaline)
    or list of strings with names of quantities to use
name_column: str or False, default is 'False'
    optional name of column
verbose: Boolean
    verbose flag (default False)
include: bool, default False
    whether to include calculated values to DataFrame
Output
tot_conc: pd.Series
    Total concentration of contaminants in [ug/l]
Source code in mibiscreen/analysis/sample/concentrations.py
def total_concentration(
        data_frame,
        name_list = "all",
        name_column = False,
        verbose = False,
        include = False,
        **kwargs,
        ):
    """Calculate total concentration of given list of quantities.

    Input
    -----
        data: pd.DataFrame
            Contaminant concentrations in [ug/l], i.e. microgram per liter
        name_list: str or list, dafault is 'all'
            either short name for group of quantities to use, such as:
                    - 'all' (all qunatities given in data frame except settings)
                    - 'BTEX' (for contaminant group: benzene, toluene, ethylbenzene, xylene)
                    - 'BTEXIIN' (for contaminant group: benzene, toluene, ethylbenzene, xylene,
                                  indene, indane and naphthaline)
            or list of strings with names of quantities to use
        name_column: str or False, default is 'False'
            optional name of column
        verbose: Boolean
            verbose flag (default False)
        include: bool, default False
            whether to include calculated values to DataFrame


    Output
    ------
        tot_conc: pd.Series
            Total concentration of contaminants in [ug/l]

    """
    if verbose:
        print('==============================================================')
        print(" Running function 'total_concentration()' on data")
        print('==============================================================')

    ### check on correct data input format and extracting column names as list
    data,cols= check_data_frame(data_frame,inplace = include)

    ### sorting out which columns in data to use for summation of concentrations
    quantities, _ = determine_quantities(cols,name_list = name_list, verbose = verbose)

    ### actually performing summation
    # try:
    tot_conc = data[quantities].sum(axis = 1)
    # except TypeError:
    #     raise ValueError("Data not in standardized format. Run 'standardize()' first.")

    if name_column is False:
        if isinstance(name_list, str):
            name_column = 'total concentration {}'.format(name_list)
        elif isinstance(name_list, list):
            name_column = 'total concentration selection'
    else:
        if not isinstance(name_column, str):
            raise ValueError("Keyword 'name_column' needs to be a string or False.")

    tot_conc.rename(name_column,inplace = True)
    if verbose:
        print('________________________________________________________________')
        print("{} in [ug/l] is:\n{}".format(name_column,tot_conc))
        print('--------------------------------------------------')

    ### additing series to data frame
    if include:
        data[name_column] = tot_conc

    return tot_conc

total_contaminant_concentration(data_frame, contaminant_group='BTEXIIN', include=False, verbose=False)

Function to calculate total concentration of contaminants.

Input
data: pd.DataFrame
    Contaminant contentrations in [ug/l], i.e. microgram per liter
contaminant_group: str
    Short name for group of contaminants to use
    default is 'BTEXIIN' (for benzene, toluene, ethylbenzene, xylene,
                          indene, indane and naphthaline)
include: bool, default False
    Whether to modify the DataFrame rather than creating a new one.
verbose: Boolean
    verbose flag (default False)
Output
tot_conc: pd.Series
    Total concentration of contaminants in [ug/l]
Source code in mibiscreen/analysis/sample/concentrations.py
def total_contaminant_concentration(
        data_frame,
        contaminant_group = "BTEXIIN",
        include = False,
        verbose = False,
        ):
    """Function to calculate total concentration of contaminants.

    Input
    -----
        data: pd.DataFrame
            Contaminant contentrations in [ug/l], i.e. microgram per liter
        contaminant_group: str
            Short name for group of contaminants to use
            default is 'BTEXIIN' (for benzene, toluene, ethylbenzene, xylene,
                                  indene, indane and naphthaline)
        include: bool, default False
            Whether to modify the DataFrame rather than creating a new one.
        verbose: Boolean
            verbose flag (default False)

    Output
    ------
        tot_conc: pd.Series
            Total concentration of contaminants in [ug/l]

    """
    if verbose:
        print('==============================================================')
        print(" Running function 'total_contaminant_concentration()' on data")
        print('==============================================================')

    tot_conc = total_concentration(
        data_frame,
        name_list = contaminant_group,
        name_column = names.name_total_contaminants,
        verbose = verbose,
        include = include,
        )

    return tot_conc

total_count(data_frame, name_list='all', threshold=0.0, verbose=False, include=False, **kwargs)

Calculate total number of quantities with concentration exceeding threshold value.

Input
data: pd.DataFrame
    Contaminant concentrations in [ug/l], i.e. microgram per liter
name_ist: str or list, dafault is 'all'
    either short name for group of quantities to use, such as:
            - 'all' (all qunatities given in data frame except settings)
            - 'BTEX' (for benzene, toluene, ethylbenzene, xylene)
            - 'BTEXIIN' (for benzene, toluene, ethylbenzene, xylene,
                          indene, indane and naphthaline)
    or list of strings with names of quantities to use
threshold: float, default 0
    threshold concentration value in [ug/l] to test on exceedence
verbose: Boolean
    verbose flag (default False)
include: bool, default False
    whether to include calculated values to DataFrame
Output
tot_count: pd.Series
    Total number of quantities with concentration exceeding threshold value
Source code in mibiscreen/analysis/sample/concentrations.py
def total_count(
        data_frame,
        name_list = "all",
        threshold = 0.,
        verbose = False,
        include = False,
        **kwargs,
        ):
    """Calculate total number of quantities with concentration exceeding threshold value.

    Input
    -----
        data: pd.DataFrame
            Contaminant concentrations in [ug/l], i.e. microgram per liter
        name_ist: str or list, dafault is 'all'
            either short name for group of quantities to use, such as:
                    - 'all' (all qunatities given in data frame except settings)
                    - 'BTEX' (for benzene, toluene, ethylbenzene, xylene)
                    - 'BTEXIIN' (for benzene, toluene, ethylbenzene, xylene,
                                  indene, indane and naphthaline)
            or list of strings with names of quantities to use
        threshold: float, default 0
            threshold concentration value in [ug/l] to test on exceedence
        verbose: Boolean
            verbose flag (default False)
        include: bool, default False
            whether to include calculated values to DataFrame

    Output
    ------
        tot_count: pd.Series
            Total number of quantities with concentration exceeding threshold value

    """
    if verbose:
        print('==============================================================')
        print(" Running function 'total_count()' on data")
        print('==============================================================')

    threshold = float(threshold)
    if threshold<0:
        raise ValueError("Threshold value '{}' not valid.".format(threshold))

    ### check on correct data input format and extracting column names as list
    data,cols= check_data_frame(data_frame,inplace = include)

    ### sorting out which column in data to use for summation of concentrations
    quantities, _ = determine_quantities(cols,name_list = name_list, verbose = verbose)

    ### actually performing count of values above threshold:
    try:
        total_count = (data[quantities]>threshold).sum(axis = 1)
    except TypeError:
        raise ValueError("Data not in standardized format. Run 'standardize()' first.")

    if isinstance(name_list, str):
        name_column = 'total count {}'.format(name_list)
    elif isinstance(name_list, list):
        name_column = 'total count selection'
    total_count.rename(name_column,inplace = True)

    if verbose:
        print('________________________________________________________________')
        print("Number of quantities out of {} exceeding \
              concentration of {:.2f} ug/l :\n{}".format(len(quantities),threshold,total_count))
        print('--------------------------------------------------')

    if include:
        data[name_column] = total_count

    return total_count

properties

Properties for Natural Attenuation Screening.

File containing name specifications of quantities and parameters measured in groundwater samples useful for biodegredation and bioremediation analysis

@author: A. Zech

screening_NA

Routines for calculating natural attenuation potential.

@author: Alraune Zech

electron_balance(data_frame, include=False, verbose=False, **kwargs)

Calculating electron balance between reductors and oxidators.

Determines ratio between the amount of electrons available and those needed for oxidation of the contaminants based on values determined by the routines “reductors()” and “oxidators()”.

Ratio higher then one indicates sufficient electrons available for degredation, values smaller 1 indicates not sufficient supply of electrons to reduce the present amount of contaminants.

Input
data_frame: pd.DataFrame
    tabular data containinng "total_reductors" and "total_oxidators"
        -total amount of electrons available for reduction [mmol e-/l]
        -total amount of electrons needed for oxidation [mmol e-/l]
include: bool, default False
    Whether to modify the DataFrame rather than creating a new one.
verbose: Boolean
    verbose flag (default False)
Output
e_bal : pd.Series
    Ratio of electron availability: electrons available for reduction
    devided by electrons needed for oxidation
Source code in mibiscreen/analysis/sample/screening_NA.py
def electron_balance(
        data_frame,
        include = False,
        verbose = False,
        **kwargs,
        ):
    """Calculating electron balance between reductors and oxidators.

    Determines ratio between the amount of electrons available and those
    needed for oxidation of the contaminants based on values determined by
    the routines "reductors()" and "oxidators()".

    Ratio higher then one indicates sufficient electrons available for degredation,
    values smaller 1 indicates not sufficient supply of electrons to reduce
    the present amount of contaminants.

    Input
    -----
        data_frame: pd.DataFrame
            tabular data containinng "total_reductors" and "total_oxidators"
                -total amount of electrons available for reduction [mmol e-/l]
                -total amount of electrons needed for oxidation [mmol e-/l]
        include: bool, default False
            Whether to modify the DataFrame rather than creating a new one.
        verbose: Boolean
            verbose flag (default False)

    Output
    ------
        e_bal : pd.Series
            Ratio of electron availability: electrons available for reduction
            devided by electrons needed for oxidation

    """
    if verbose:
        print('==============================================================')
        print(" Running function 'electron_balance()' on data")
        print('==============================================================')

    ### check on correct data input format and extracting column names as list
    data,cols= check_data_frame(data_frame,inplace = include)

    if names.name_total_reductors in cols:
        tot_reduct = data[names.name_total_reductors]
    else:
        tot_reduct = reductors(data,**kwargs)

    if names.name_total_oxidators in cols:
        tot_oxi = data[names.name_total_oxidators]
    else:
        tot_oxi = oxidators(data,**kwargs)

    e_bal = tot_reduct.div(tot_oxi, axis=0)
    e_bal.name = names.name_e_balance

    if include:
        data[names.name_e_balance] = e_bal

    if verbose:
        print("Electron balance e_red/e_cont is:\n{}".format(e_bal))
        print('---------------------------------')

    return e_bal

oxidators(data_frame, contaminant_group='BTEXIIN', include=False, verbose=False, **kwargs)

Calculate the amount of electron oxidators [mmol e-/l].

Calculates the amount of electrons needed for oxidation of the contaminants. It transformes the concentrations of contaminants to molar concentrations using molecular masses in [mg/mmol] and further identifies number of electrons from the chemical reactions using stiochiometric ratios

alternatively: based on nitrogen and phosphate availability

Input
data: pd.DataFrame
    Contaminant contentrations in [ug/l], i.e. microgram per liter
contaminant_group: str
    Short name for group of contaminants to use
    default is 'BTEXIIN' (for benzene, toluene, ethylbenzene, xylene,
                          indene, indane and naphthaline)
inplace: bool, default False
    Whether to modify the DataFrame rather than creating a new one.
verbose: Boolean
    verbose flag (default False)
Output
tot_oxi: pd.Series
    Total amount of electrons oxidators in [mmol e-/l]
Source code in mibiscreen/analysis/sample/screening_NA.py
def oxidators(
    data_frame,
    contaminant_group = "BTEXIIN",
    include = False,
    verbose = False,
    **kwargs,
    ):
    """Calculate the amount of electron oxidators [mmol e-/l].

    Calculates the amount of electrons needed for oxidation of the contaminants.
    It transformes the concentrations of contaminants to molar concentrations using
    molecular masses in [mg/mmol] and further identifies number of electrons from
    the chemical reactions using stiochiometric ratios

    alternatively: based on nitrogen and phosphate availability

    Input
    -----
        data: pd.DataFrame
            Contaminant contentrations in [ug/l], i.e. microgram per liter
        contaminant_group: str
            Short name for group of contaminants to use
            default is 'BTEXIIN' (for benzene, toluene, ethylbenzene, xylene,
                                  indene, indane and naphthaline)
        inplace: bool, default False
            Whether to modify the DataFrame rather than creating a new one.
        verbose: Boolean
            verbose flag (default False)

    Output
    ------
        tot_oxi: pd.Series
            Total amount of electrons oxidators in [mmol e-/l]
    """
    if verbose:
        print('==============================================================')
        print(" Running function 'oxidators()' on data")
        print('==============================================================')

    ### check on correct data input format and extracting column names as list
    data,cols= check_data_frame(data_frame,inplace = include)

    ### sorting out which columns in data to use for summation of electrons available
    quantities,_ = determine_quantities(cols,name_list = contaminant_group, verbose = verbose)

    try:
        tot_oxi = 0.
        for cont in quantities:
            cm_cont = data[cont]* 0.001/properties[cont]['molecular_mass'] # molar concentration in mmol/l
            tot_oxi += cm_cont *  properties[cont]['factor_stoichiometry']
    except TypeError:
        raise ValueError("Data not in standardized format. Run 'standardize()' first.")

    tot_oxi.rename(names.name_total_oxidators,inplace = True)
    if verbose:
        print("Total amount of oxidators per well in [mmol e-/l] is:\n{}".format(tot_oxi))
        print('-----------------------------------------------------')

    ### additing series to data frame
    if include:
        data[names.name_total_oxidators] = tot_oxi

    return tot_oxi

reductors(data_frame, ea_group='ONS', include=False, verbose=False, **kwargs)

Calculate the amount of electron reductors [mmol e-/l].

It determines the amount of electrons availble from electron acceptors (default: mobile dissolved oxygen, nitrate, and sulfate).

It relates concentrations to electrons using the stochimetry from the chemical reactions producting electrons and the molecular mass values for the quantities in [mg/mmol].

Input
data: pd.DataFrame
    concentration values of electron acceptors in [mg/l]
ea_group: str
    Short name for group of electron acceptors to use
    default is 'ONS' (for oxygen, nitrate, and sulfate)
include: bool, default False
    Whether to modify the DataFrame rather than creating a new one.
verbose: Boolean
    verbose flag (default False)
Output
tot_reduct: pd.Series
Total amount of electrons needed for reduction in [mmol e-/l]
Source code in mibiscreen/analysis/sample/screening_NA.py
def reductors(
    data_frame,
    ea_group = 'ONS',
    include = False,
    verbose = False,
    **kwargs,
    ):
    """Calculate the amount of electron reductors [mmol e-/l].

    It determines the amount of electrons availble from electron acceptors
    (default: mobile dissolved oxygen, nitrate, and sulfate).

    It relates concentrations to electrons using the stochimetry from the
    chemical reactions producting electrons and the molecular mass values
    for the quantities in [mg/mmol].

    Input
    -----
        data: pd.DataFrame
            concentration values of electron acceptors in [mg/l]
        ea_group: str
            Short name for group of electron acceptors to use
            default is 'ONS' (for oxygen, nitrate, and sulfate)
        include: bool, default False
            Whether to modify the DataFrame rather than creating a new one.
        verbose: Boolean
            verbose flag (default False)

    Output
    ------
        tot_reduct: pd.Series
        Total amount of electrons needed for reduction in [mmol e-/l]
    """
    if verbose:
        print('==============================================================')
        print(" Running function 'reductors()' on data")
        print('==============================================================')

    ### check on correct data input format and extracting column names as list
    data,cols= check_data_frame(data_frame,inplace = include)

    ### sorting out which columns in data to use for summation of electrons available
    quantities,_ = determine_quantities(cols,name_list = ea_group, verbose = verbose)

    ### actually performing summation
    try:
        tot_reduct = 0.
        for ea in quantities:
            tot_reduct += properties[ea]['factor_stoichiometry']* data[ea]/properties[ea]['molecular_mass']
    except TypeError:
        raise ValueError("Data not in standardized format. Run 'standardize()' first.")

    tot_reduct.rename(names.name_total_reductors,inplace = True)
    if verbose:
        print("Total amount of electron reductors per well in [mmol e-/l] is:\n{}".format(tot_reduct))
        print('----------------------------------------------------------------')

    ### additing series to data frame
    if include:
        data[names.name_total_reductors] = tot_reduct

    return tot_reduct

sample_NA_screening(data_frame, ea_group='ONS', contaminant_group='BTEXIIN', include=False, verbose=False, **kwargs)

Screening of NA potential for each sample in one go.

Determines for each sample, the availability of electrons for (bio)degradation of contaminants from concentrations of (mobile dissolved) electron acceptors (default: oxygen, nitrate, sulfate). It puts them into relation to electrons needed for degradation using contaminant concentrations. Resulting electron balance is linked to a color flag/traffic light indicating status: - green: amount of electrons available for (bio-)degradation is higher than amount needed for degrading present contaminant mass/concentration –> potential for natural attenuation - yellow: electron balance unknown because data is not sufficient –> more information needed - red: amount of electrons available for (bio-)degradation is lower than amount needed for degrading present contaminant mass/concentration –> limited potential for natural attenuation

Sufficient supply of electrons is a prerequite for biodegradation and thus the

potential of natural attenuation (NA) as remediation strategy. Input


data_frame: pd.DataFrame
    Concentration values of
        - electron acceptors in [mg/l]
        - contaminants in [ug/l]
ea_group: str, default 'ONS'
    Short name for group of electron acceptors to use
    'ONS' stands for oxygen, nitrate, sulfate and ironII
contaminant_group: str, default 'BTEXIIN'
    Short name for group of contaminants to use
    'BTEXIIN' stands for benzene, toluene, ethylbenzene, xylene,
                           indene, indane and naphthaline
include: bool, default False
    Whether to modify the DataFrame rather than creating a new one.
verbose: Boolean, default False
    verbose flag
Output
na_data: pd.DataFrame
    Tabular data with all quantities of NA screening listed per sample
Source code in mibiscreen/analysis/sample/screening_NA.py
def sample_NA_screening(
    data_frame,
    ea_group = 'ONS',
    contaminant_group = "BTEXIIN",
    include = False,
    verbose = False,
    **kwargs,
    ):
    """Screening of NA potential for each sample in one go.

    Determines for each sample, the availability of electrons for (bio)degradation of
    contaminants from concentrations of (mobile dissolved) electron acceptors
    (default: oxygen, nitrate, sulfate). It puts them into relation to electrons
    needed for degradation using contaminant concentrations. Resulting electron
    balance is linked to a color flag/traffic light indicating status:
        - green: amount of electrons available for (bio-)degradation is higher than
                 amount needed for degrading present contaminant mass/concentration
            --> potential for natural attenuation
        - yellow: electron balance unknown because data is not sufficient
            --> more information needed
        - red: amount of electrons available for (bio-)degradation is lower than
                 amount needed for degrading present contaminant mass/concentration
            --> limited potential for natural attenuation

        Sufficient supply of electrons is a prerequite for biodegradation and thus the
    potential of natural attenuation (NA) as remediation strategy.
    Input
    -----
        data_frame: pd.DataFrame
            Concentration values of
                - electron acceptors in [mg/l]
                - contaminants in [ug/l]
        ea_group: str, default 'ONS'
            Short name for group of electron acceptors to use
            'ONS' stands for oxygen, nitrate, sulfate and ironII
        contaminant_group: str, default 'BTEXIIN'
            Short name for group of contaminants to use
            'BTEXIIN' stands for benzene, toluene, ethylbenzene, xylene,
                                   indene, indane and naphthaline
        include: bool, default False
            Whether to modify the DataFrame rather than creating a new one.
        verbose: Boolean, default False
            verbose flag

    Output
    ------
        na_data: pd.DataFrame
            Tabular data with all quantities of NA screening listed per sample
    """
    if verbose:
        print('==============================================================')
        print(" Running function 'sample_NA_screening()' on data")
        print('==============================================================')

    ### check on correct data input format and extracting column names as list
    data,_= check_data_frame(data_frame,inplace = include)

    tot_reduct = reductors(data,
                           ea_group = ea_group,
                           include = include,
                           verbose = verbose)
    tot_oxi = oxidators(data,
                        contaminant_group = contaminant_group,
                        include = include,
                        verbose = verbose)
    e_bal = electron_balance(data,
                             include = include,
                             verbose = verbose)
    na_traffic = sample_NA_traffic(data,
                            contaminant_group = contaminant_group,
                            include = include,
                            verbose = verbose)

    list_new_quantities = [tot_reduct,tot_oxi,e_bal,na_traffic]

    if include is False:
       na_data = extract_settings(data)

       for add in list_new_quantities:
           na_data.insert(na_data.shape[1], add.name, add)
    else:
        na_data = data

    return na_data

sample_NA_traffic(data_frame, include=False, verbose=False, **kwargs)

Evaluating availability of electrons for biodegredation interpreting electron balance.

Function builds on ‘electron_balance()’, based on electron availability calculated from concentrations of contaminant and electron acceptors.

Sufficient supply of electrons is a prerequite for biodegradation and thus the potential of natural attenuation (NA) as remediation strategy. The functions interprets the electron balance giving it a traffic light of: - green: amount of electrons available for (bio-)degradation is higher than amount needed for degrading present contaminant mass/concentration –> potential for natural attenuation - yellow: electron balance unknown because data is not sufficient –> more information needed - red: amount of electrons available for (bio-)degradation is lower than amount needed for degrading present contaminant mass/concentration –> limited potential for natural attenuation

Input
data_frame: pd.DataFrame
    Ratio of electron availability
include: bool, default False
    Whether to modify the DataFrame rather than creating a new one.
verbose: Boolean
    verbose flag (default False)
Output
traffic : pd.Series
    Traffic light (decision) based on ratio of electron availability
Source code in mibiscreen/analysis/sample/screening_NA.py
def sample_NA_traffic(
        data_frame,
        include = False,
        verbose = False,
        **kwargs,
        ):
    """Evaluating availability of electrons for biodegredation interpreting electron balance.

    Function builds on 'electron_balance()', based on electron availability
    calculated from concentrations of contaminant and electron acceptors.

    Sufficient supply of electrons is a prerequite for biodegradation and thus the
    potential of natural attenuation (NA) as remediation strategy. The functions
    interprets the electron balance giving it a traffic light of:
        - green: amount of electrons available for (bio-)degradation is higher than
                 amount needed for degrading present contaminant mass/concentration
            --> potential for natural attenuation
        - yellow: electron balance unknown because data is not sufficient
            --> more information needed
        - red: amount of electrons available for (bio-)degradation is lower than
                 amount needed for degrading present contaminant mass/concentration
            --> limited potential for natural attenuation

    Input
    -----
        data_frame: pd.DataFrame
            Ratio of electron availability
        include: bool, default False
            Whether to modify the DataFrame rather than creating a new one.
        verbose: Boolean
            verbose flag (default False)

    Output
    ------
        traffic : pd.Series
            Traffic light (decision) based on ratio of electron availability

    """
    if verbose:
        print('==============================================================')
        print(" Running function 'sample_NA_traffic()' on data")
        print('==============================================================')

    ### check on correct data input format and extracting column names as list
    data,cols= check_data_frame(data_frame,inplace = include)

    if names.name_e_balance in cols:
        e_balance = data[names.name_e_balance]
    else:
        e_balance = electron_balance(data,**kwargs)

    e_bal = e_balance.values
    traffic = np.where(e_bal<1,"red","green")
    traffic[np.isnan(e_bal)] = "y"

    NA_traffic = pd.Series(name =names.name_na_traffic_light,
                           data = traffic,
                           index = e_balance.index
                           )

    if include:
        data[names.name_na_traffic_light] = NA_traffic

    if verbose:
        print("Evaluation if natural attenuation (NA) is ongoing:")#" for {}".format(contaminant_group))
        print('--------------------------------------------------')
        print("Red light: Reduction is limited at {} out of {} locations".format(
            np.sum(traffic == "red"),len(e_bal)))
        print("Green light: Reduction is limited at {} out of {} locations".format(
            np.sum(traffic == "green"),len(e_bal)))
        print("Yellow light: No decision possible at {} out of {} locations".format(
            np.sum(np.isnan(e_bal)),len(e_bal)))
        print('________________________________________________________________')

    return NA_traffic