Skip to content

mibiscreen.data API reference

mibiscreen module for data handling.

check_data

Functions for data handling and standardization.

@author: Alraune Zech

check_columns(data_frame, standardize=False, reduce=False, verbose=True)

Function checking names of columns of data frame.

Function that looks at the column names and links it to standard names. Optionally, it renames identified column names to the standard names of the model.


data_frame: pd.DataFrame
    dataframe with the measurements
standardize: Boolean, default False
    Whether to standardize identified column names
reduce: Boolean, default False
    Whether to reduce data to known quantities
verbose: Boolean, default True
    verbosity flag

tuple: three list containing names of
        list with identitied quantities in data (but not standardized names)
        list with unknown quantities in data (not in list of standardized names)
        list with standard names of identified quantities
Raises:

None (yet).

Example:

Todo’s: - complete list of potential contaminants, environmental factors - add name check for metabolites?

Source code in mibiscreen/data/check_data.py
def check_columns(data_frame,
                  standardize = False,
                  reduce = False,
                  verbose = True):
    """Function checking names of columns of data frame.

    Function that looks at the column names and links it to standard names.
    Optionally, it renames identified column names to the standard names of the model.

    Args:
    -------
        data_frame: pd.DataFrame
            dataframe with the measurements
        standardize: Boolean, default False
            Whether to standardize identified column names
        reduce: Boolean, default False
            Whether to reduce data to known quantities
        verbose: Boolean, default True
            verbosity flag

    Returns:
    -------
        tuple: three list containing names of
                list with identitied quantities in data (but not standardized names)
                list with unknown quantities in data (not in list of standardized names)
                list with standard names of identified quantities

    Raises:
    -------
    None (yet).

    Example:
    -------
    Todo's:
        - complete list of potential contaminants, environmental factors
        - add name check for metabolites?
    """
    if verbose:
        print('==============================================================')
        print(" Running function 'check_columns()' on data")
        print('==============================================================')

    data,cols= check_data_frame(data_frame,
                                sample_name_to_index = False,
                                inplace = True)

    results = standard_names(cols,
                             standardize = False,
                             reduce = False,
                             verbose = False,
                             )

    column_names_standard = results[0]
    column_names_known = results[1]
    column_names_unknown = results[2]
    column_names_transform = results[3]

    ### check on duplicates in column names after name standardization
    ### (i.e. in case the same quantity has been provided with different non-standard names
    duplicates_indices = _check_duplicates_in_list(column_names_standard)

    if duplicates_indices:
    # duplicates_indices is NOT empty — do something here
        print("WARNING: Duplicates found in list of standard names:")
        for name, indices in duplicates_indices.items():
            print(f"'{name}' occur has been provided as:")
            for i in indices:
                # Print the entry from other_list at index i
                print(f" - '{column_names_known[i]}' (index {i})")
        print('Remove or rename duplicate quantities in data.')
        print('________________________________________________________________')

    if standardize:
        data.columns = [column_names_transform.get(x, x) for x in data.columns]

    if reduce:
        data.drop(labels = column_names_unknown,axis = 1,inplace=True)

    if verbose:
        print("{} quantities identified in provided data.".format(len(column_names_known)))
        print("List of names with standard names:")
        print('----------------------------------')
        for i,name in enumerate(column_names_known):
            print(name," --> ",column_names_standard[i])
        print('---------------------------------------------------------')
        if standardize:
            print("Identified column names have been standardized")
        else:
            print("\nRenaming can be done by setting keyword 'standardize' to True.\n")
        print('________________________________________________________________')
        print("{} quantities have not been identified in provided data:".format(len(column_names_unknown)))
        print('---------------------------------------------------------')
        for i,name in enumerate(column_names_unknown):
            print(name)
        print('---------------------------------------------------------')
        if reduce:
            print("Not identified quantities have been removed from data frame")
        else:
            print("\nReduction to known quantities can be done by setting keyword 'reduce' to True.\n")
        print('================================================================')

    return (column_names_known,column_names_unknown,column_names_standard)

check_data_frame(data_frame, sample_name_to_index=False, inplace=False)

Checking data on correct format.

Tests if provided data is a pandas data frame and provides column names. Optionally it sets the sample name as index.

Input
data_frame: pd.DataFrame
    quantities for data analysis given per sample
sample_name_to_index:  Boolean, default False
    Whether to set the sample name to the index of the DataFrame
inplace: Boolean, default False
    Whether to modify the DataFrame rather than creating a new one.
Output
data: pd.DataFrame
    copy of given dataframe with index set to sample name
cols: list
    List of column names
Source code in mibiscreen/data/check_data.py
def check_data_frame(data_frame,
                     sample_name_to_index = False,
                     inplace = False,
                     ):
    """Checking data on correct format.

    Tests if provided data is a pandas data frame and provides column names.
    Optionally it sets the sample name as index.

    Input
    -----
        data_frame: pd.DataFrame
            quantities for data analysis given per sample
        sample_name_to_index:  Boolean, default False
            Whether to set the sample name to the index of the DataFrame
        inplace: Boolean, default False
            Whether to modify the DataFrame rather than creating a new one.

    Output
    ------
        data: pd.DataFrame
            copy of given dataframe with index set to sample name
        cols: list
            List of column names
    """
    if not isinstance(data_frame, pd.DataFrame):
        raise ValueError("Data has to be a panda-DataFrame or Series \
                          but is given as type {}".format(type(data_frame)))

    if inplace is False:
        data = data_frame.copy()
    else:
        data = data_frame

    if sample_name_to_index:
        if names.name_sample not in data.columns:
            print("Warning: No sample name provided for making index. Consider standardizing data first")
        else:
            data.set_index(names.name_sample,inplace = True)

    if isinstance(data, pd.Series):
        cols = [data.name]
    else:
        cols = data.columns.to_list()

    return data, cols

check_units(data, verbose=True)

Function to check the units of the measurements.


data: pandas.DataFrames
    dataframe with the measurements where first row contains
    the units or a dataframe with only the column names and units
verbose: Boolean
    verbose statement (default True)

col_check_list: list
    quantities whose units need checking/correction
col_not_checked: list
    quantities not identified, thus not checked on units

None (yet).
Example:
To be added.
Source code in mibiscreen/data/check_data.py
def check_units(data,
                verbose = True):
    """Function to check the units of the measurements.

    Args:
    -------
        data: pandas.DataFrames
            dataframe with the measurements where first row contains
            the units or a dataframe with only the column names and units
        verbose: Boolean
            verbose statement (default True)

    Returns:
    -------
        col_check_list: list
            quantities whose units need checking/correction
        col_not_checked: list
            quantities not identified, thus not checked on units

    Raises:
    -------
        None (yet).

    Example:
    -------
        To be added.
    """
    if verbose:
        print('================================================================')
        print(" Running function 'check_units()' on data")
        print('================================================================')

    if not isinstance(data, pd.DataFrame):
        raise ValueError("Provided data is not a data frame.")
    elif data.shape[0]>1:
        units = data.drop(labels = np.arange(1,data.shape[0]))
    else:
        units = data.copy()

    ### testing if provided data frame contains any units (at all)
    units_in_data = set(map(lambda x: str(x).lower(), units.iloc[0,:].values))
    test_unit = False
    for u in all_units:
        if u in units_in_data:
            test_unit = True
            break
    if not test_unit:
        raise ValueError("Error: The second line in the dataframe is supposed\
                         to specify the units. No units were detected in this\
                         line, check https://mibipret.github.io/mibiscreen/ Data\
                         documentation.")

    # standardize column names (as it might not has happened for data yet)
    # (column_names_known,column_names_unknown,column_names_standard) = check_columns(units,
    check_columns(units,
                  standardize = True,
                  verbose = False)
    col_check_list= []
    col_not_checked  = []

    properties_all = {**properties_sample_settings,
                      **properties_geochemicals,
                      **properties_contaminants,
                      **properties_metabolites,
                      **properties_isotopes,
                      **contaminants_analysis,
    }

    ### run through all quantity columns and check their units
    for quantity in units.columns:
        ### identify standard unit for each column idenfied:
        if quantity in properties_all.keys():# test on standard column name
            standard_unit = properties_all[quantity]['standard_unit']
        elif quantity.split('-')[0] in properties_all.keys(): # test on isotopes
            standard_unit = properties_all[quantity.split('-')[0]]['standard_unit']
        else:
            col_not_checked.append(quantity)
            continue

        ### check on given unit (also considering alternative unit names)
        if standard_unit != names.unit_less:
            other_names_unit = properties_units[standard_unit]['other_names']
            if str(units[quantity][0]).lower() not in other_names_unit:
                col_check_list.append(quantity)
                if verbose:
                    print("Warning: Check unit of {}!\n Given in {}, but must be in {}."
                              .format(quantity,units[quantity][0],standard_unit))

    if verbose:
        print('________________________________________________________________')
        if len(col_check_list) == 0:
            print(" All identified quantities given in requested units.")
        else:
            print(" Quantities not in requested units:")
            print(*col_check_list, sep='\n')
        if len(col_not_checked) != 0:
            print('________________________________________________________________')
            print(" Quantities not identified (and thus not checked on units):")
            print(*col_not_checked, sep='\n')
            print('================================================================')

    return col_check_list,col_not_checked

check_values(data_frame, dl_factor=None, to_replace_list=['-', '--', '', ' ', ' ', np.inf, -np.inf], to_replace_value=np.nan, inplace=False, verbose=True)

Function that checks on values and cleans DataFrame.

Cleaning includes
  • fixing decimal commas
  • converting strings to floats
  • replacing empty strings with NaN
  • replacing ‘inf’ (infinity) with NaN
  • and optionally replacing detection limits.

data_frame: pandas.DataFrames
    dataframe with the measurements
dl_factor: float or None, default None
    if set, values with '<' are replaced by (value * dl_factor)
to_replace_list: list, default ["-",'--','',' ','  ']
    list of strings to replace before cleaning
to_replace_value: float or np.nan, default: np.nan
    value to replace strings to_replace_list with
inplace: Boolean, default False
    if True modifies df in place, else returns a cleaned copy
verbose: Boolean
    verbose statement (default True)

cleaned: pandas.DataFrame
    Cleaned dataframe without units
Source code in mibiscreen/data/check_data.py
def check_values(data_frame,
                 dl_factor=None,
                 to_replace_list = ["-",'--','',' ','  ',np.inf,-np.inf],
                 to_replace_value = np.nan,
                 inplace=False,
                 verbose = True,
                 ):
    """Function that checks on values and cleans DataFrame.

    Cleaning includes:
            - fixing decimal commas
            - converting strings to floats
            - replacing empty strings with NaN
            - replacing 'inf' (infinity) with NaN
            - and optionally replacing detection limits.

    Args:
    -------
        data_frame: pandas.DataFrames
            dataframe with the measurements
        dl_factor: float or None, default None
            if set, values with '<' are replaced by (value * dl_factor)
        to_replace_list: list, default ["-",'--','',' ','  ']
            list of strings to replace before cleaning
        to_replace_value: float or np.nan, default: np.nan
            value to replace strings to_replace_list with
        inplace: Boolean, default False
            if True modifies df in place, else returns a cleaned copy
        verbose: Boolean
            verbose statement (default True)

    Returns:
    -------
        cleaned: pandas.DataFrame
            Cleaned dataframe without units

    """
    if verbose:
        print('================================================================')
        print(" Running function 'check_values()' on data")
        print('================================================================')

    df,cols= check_data_frame(data_frame, inplace = inplace)

    if dl_factor is not None and (dl_factor>1 or dl_factor<0):
        raise ValueError("Factor needs to be between 0 and 1 or 'nan'")

    ## testing if provided data frame contains first row with units
    for u in df.iloc[0].to_list():
        if u in all_units:
            print("WARNING: First row identified as units, has been removed for value check")
            print('________________________________________________________________')
            df.drop(labels = 0,inplace = True)
            break

    for sign in to_replace_list:
        df.iloc[:,:] = df.iloc[:,:].replace(to_replace=sign, value=to_replace_value)

    detection_limit_columns = []
    failed_conversion_columns = []

    def clean_quantity(quantity_series,quantity_name):
        """Cleans a pandas Series of quantity values.

        Cleaning is by parsing strings, handling detection limits,
        and converting to floats where possible.

        Parameters:
        ----------
        quantity_series : pandas.Series
            The input Series containing quantity values that may include strings, numbers, or
            special characters (e.g., '<' for values below detection limits).

        quantity_name : str
            Column name of quantity being cleaned.

        Returns:
        -------
        cleaned_series : pandas.Series
            The cleaned version of the input series, where:
                - Strings like '<0.05' are converted to floats and optionally adjusted by `dl_factor`
                - Commas are replaced with periods for decimal conversion
                - Non-convertible values are returned as-is
                - Flags are set for values that are below detection limits or failed conversion

        """
        flags = {'found_dl': False, 'found_failed': False}
        def clean_value(val):
            """Detecting and cleaning special special values.

            Parameters
            ----------
            val : str
                values within a pandas.series

            Returns
            -------
            val: str or float
                either original string or float when conversion to number possible.

            """
            if isinstance(val, str):
                val = val.strip().replace(',', '.')
                if val.startswith('<'):
                    try:
                        number = float(val[1:])
                        flags['found_dl'] = True
                        if dl_factor is not None:
                            return number * dl_factor
                        else:
                            return np.nan
                    except ValueError:
                        flags['found_failed'] = True
                        return val
                try:
                    return float(val)
                except ValueError:
                    flags['found_failed'] = True
                    return val
            return val

        cleaned = quantity_series.apply(clean_value)

        if flags['found_dl']:
            detection_limit_columns.append(quantity_name)
        if flags['found_failed']:
            failed_conversion_columns.append(quantity_name)

        return cleaned

    for quantity in cols:
        df[quantity] = clean_quantity(df[quantity], quantity)

    if verbose:
        if detection_limit_columns:
            print("Quantities with values given by detection limits:")
            print('-----------------------------------------------------------')
            for name in detection_limit_columns:
                print(name)
            print('-----------------------------------------------------------')
            print("All values with detection limit, given by '<X' have been replaced")
            print(" by the values multiuplied with the specified factor: {} * X ".format(dl_factor))
            print('================================================================')

        if failed_conversion_columns:
            print("Quantities where not all values could be transformed to numerical (int/float):")
            print('-----------------------------------------------------------')
            for name in failed_conversion_columns:
                print(name)
            print('================================================================')
    return df

standard_names(name_list, standardize=True, reduce=False, verbose=False)

Function transforming list of names to standard names.

Function that looks at the names (of e.g. environmental variables, contaminants, metabolites, isotopes, etc) and provides the corresponding standard names.


name_list: string or list of strings
    names of quantities to be transformed to standard
standardize: Boolean, default False
    Whether to standardize identified column names
reduce: Boolean, default False
    Whether to reduce data to known quantities
verbose: Boolean, default True
    verbosity flag

tuple: three lists containing names of
        list with identitied quantities in data (but not standardized names)
        list with unknown quantities in data (not in list of standardized names)
        list with standard names of identified quantities
       + one dictionary mapping identified quantities to their
           standard values for fast name transformation
Raises:

None (yet).

Example:

Todo’s: - complete list of potential contaminants, environmental factors - add name check for metabolites?

Source code in mibiscreen/data/check_data.py
def standard_names(name_list,
                   standardize = True,
                   reduce = False,
                   verbose = False,
                   ):
    """Function transforming list of names to standard names.

    Function that looks at the names (of e.g. environmental variables, contaminants,
    metabolites, isotopes, etc) and provides the corresponding standard names.

    Args:
    -------
        name_list: string or list of strings
            names of quantities to be transformed to standard
        standardize: Boolean, default False
            Whether to standardize identified column names
        reduce: Boolean, default False
            Whether to reduce data to known quantities
        verbose: Boolean, default True
            verbosity flag

    Returns:
    -------
        tuple: three lists containing names of
                list with identitied quantities in data (but not standardized names)
                list with unknown quantities in data (not in list of standardized names)
                list with standard names of identified quantities
               + one dictionary mapping identified quantities to their
                   standard values for fast name transformation

    Raises:
    -------
    None (yet).

    Example:
    -------
    Todo's:
        - complete list of potential contaminants, environmental factors
        - add name check for metabolites?
    """
    names_standard = []
    names_known = []
    names_unknown = []
    names_transform = {}

    if isinstance(name_list, str):
        name_list = [name_list]
    elif isinstance(name_list, list):
        for name in name_list:
            if not isinstance(name, str):
                raise ValueError("Entry in provided list of names is not a string:", name)

    properties_all = {**properties_sample_settings,
                      **properties_geochemicals,
                      **properties_contaminants,
                      **properties_metabolites,
                      **properties_isotopes,
                      **contaminants_analysis,
    }
    dict_names=_generate_dict_other_names(properties_all)
    other_names_contaminants = _generate_dict_other_names(properties_contaminants)
    other_names_isotopes = _generate_dict_other_names(properties_isotopes)

    for x in name_list:
        y = dict_names.get(x, False)
        x_isotope = x.split('-')[0]
        y_isotopes = other_names_isotopes.get(x_isotope.lower(), False)
        if y_isotopes is not False:
            x_molecule = x.removeprefix(x_isotope+'-')
            y_molecule = other_names_contaminants.get(x_molecule.lower(), False)
            if y_molecule is False:
                names_unknown.append(x)
            else:
                y = y_isotopes+'-'+y_molecule
                names_known.append(x)
                names_standard.append(y)
                names_transform[x] = y
        else:
            y = dict_names.get(x.lower(), False)
            if y is False:
                names_unknown.append(x)
            else:
                names_known.append(x)
                names_standard.append(y)
                names_transform[x] = y

    if verbose:
        print('================================================================')
        print(" Running function 'standard_names()'")
        print('================================================================')
        print("{} of {} quantities identified in name list.".format(len(names_known),len(name_list)))
        print("List of names with standard names:")
        print('----------------------------------')
        for i,name in enumerate(names_known):
            print(name," --> ",names_standard[i])
        print('----------------------------------')
        if standardize:
            print("Identified column names have been standardized")
        else:
            print("\nRenaming can be done by setting keyword 'standardize' to True.\n")
        print('________________________________________________________________')
        print("{} quantities have not been identified in provided data:".format(len(names_unknown)))
        print("You can suggest missing quantities that could be added to the library here: <https://github.com/MiBiPreT/mibiscreen/issues/new/choose>")
        print('---------------------------------------------------------')
        for i,name in enumerate(names_unknown):
            print(name)
        print('---------------------------------------------------------')
        if reduce:
            print("Not identified quantities have been removed from data frame")
        else:
            print("\nReduction to known quantities can be done by setting keyword 'reduce' to True.\n")
        print('================================================================')

    if standardize:
        if reduce:
            return names_standard
        else:
            return names_standard + names_unknown
    else:
        return (names_standard, names_known, names_unknown, names_transform)

standardize(data_frame, reduce=True, store_csv=False, verbose=True, **kwargs)

Function providing condensed data frame with standardized names.

Function is checking names of columns and renames columns, condenses data to identified column names, checks units and names sof data frame.

Function that looks at the column names and renames the columns to the standard names of the model.


data_frame: pandas.DataFrames
    dataframe with the measurements
reduce: Boolean, default True
    whether to reduce data to known quantities (default True),
    otherwise full dataframe with renamed columns (for those identifyable) is returned
store_csv: Boolean, default False
    whether to save dataframe in standard format to csv-file
verbose: Boolean, default True
    verbose statement
**kwargs: Optional keyword arguments.
    dl_factor (float, optional): scaling factor for value given at detection limit.
       Default is None, so detection limit values are replaced by nan.

data_numeric, units: pandas.DataFrames
    Tabular data with standardized column names, values in numerics etc
    and table with units for standardized column names
Source code in mibiscreen/data/check_data.py
def standardize(data_frame,
                reduce = True,
                store_csv = False,
                verbose=True,
                **kwargs,
                ):
    """Function providing condensed data frame with standardized names.

    Function is checking names of columns and renames columns,
    condenses data to identified column names, checks units and  names
    sof data frame.

    Function that looks at the column names and renames the columns to
    the standard names of the model.

    Args:
    -------
        data_frame: pandas.DataFrames
            dataframe with the measurements
        reduce: Boolean, default True
            whether to reduce data to known quantities (default True),
            otherwise full dataframe with renamed columns (for those identifyable) is returned
        store_csv: Boolean, default False
            whether to save dataframe in standard format to csv-file
        verbose: Boolean, default True
            verbose statement
        **kwargs: Optional keyword arguments.
            dl_factor (float, optional): scaling factor for value given at detection limit.
               Default is None, so detection limit values are replaced by nan.

    Returns:
    -------
        data_numeric, units: pandas.DataFrames
            Tabular data with standardized column names, values in numerics etc
            and table with units for standardized column names

    """
    if verbose:
        print('================================================================')
        print(" Running function 'standardize()' on data")
        print('================================================================')
        print(' Function performing check of data including:')
        print('  * check of column names and standardizing them.')
        print('  * check of units and outlining which to adapt.')
        print('  * check and cleaning of values')

    data,cols= check_data_frame(data_frame,
                                sample_name_to_index = False,
                                inplace = False)

    # general column check & standardize column names
    check_columns(data,
                  standardize = True,
                  reduce = reduce,
                  verbose = verbose)

    # general unit check
    units = data.drop(labels = np.arange(1,data.shape[0]))
    col_check_list,_ = check_units(units,
                                 verbose = verbose)

    # transform data to numeric values
    data_numeric = check_values(data.drop(labels = 0),
                                inplace = False,
                                verbose = verbose,
                                **kwargs,
                                )

    # store standard data to file
    if store_csv:
        if len(col_check_list) != 0:
            print('________________________________________________________________')
            print("Data not saved because not all identified \n quantities are given in requested units.")
        else:
            try:
                data.to_csv(store_csv,index=False)
                if verbose:
                    print('________________________________________________________________')
                    print("Save standardized dataframe to file:\n", store_csv)
            except OSError:
                print("WARNING: data could not be saved. Check provided file path and name: {}".format(store_csv))
    if verbose:
        print('================================================================')

    return data_numeric, units

example_data

mibiscreen module for example data.

example_data

Example data.

Measurements on quantities and parameters in groundwater samples used for biodegredation and bioremediation analysis.

@author: Alraune Zech

example_data(data_type='all', with_units=False)

Function provinging test data for mibiscreen data analysis.


data_type: string
    Type of data to return:
        -- "all": all types of data available
        -- "set_env_cont": well setting, environmental and contaminants data
        -- "setting": well setting data only
        -- "environment": data on environmental
        -- "contaminants": data on contaminants
        -- "metabolites": data on metabolites
        -- "isotopes": data on isotopes
        -- "hydro": data on hydrogeolocial conditions
with_units: Boolean, default False
    flag to provide first row with units
    if False (no units), values in columns will be numerical
    if True (with units), values in columns will be objects

pandas.DataFrame: Tabular data with standard column names

None
Example:
To be added!
Source code in mibiscreen/data/example_data/example_data.py
def example_data(data_type = 'all',
                 with_units = False,
                 ):
    """Function provinging test data for mibiscreen data analysis.

    Args:
    -------
        data_type: string
            Type of data to return:
                -- "all": all types of data available
                -- "set_env_cont": well setting, environmental and contaminants data
                -- "setting": well setting data only
                -- "environment": data on environmental
                -- "contaminants": data on contaminants
                -- "metabolites": data on metabolites
                -- "isotopes": data on isotopes
                -- "hydro": data on hydrogeolocial conditions
        with_units: Boolean, default False
            flag to provide first row with units
            if False (no units), values in columns will be numerical
            if True (with units), values in columns will be objects

    Returns:
    -------
        pandas.DataFrame: Tabular data with standard column names

    Raises:
    -------
        None

    Example:
    -------
        To be added!
    """
    mgl = names.unit_mgperl
    microgl = names.unit_microgperl

    setting = [names.name_sample,names.name_observation_well,names.name_sample_depth]
    setting_units = [' ',' ',names.unit_meter]
    setting_s01 = ['2000-001', 'B-MLS1-3-12', -12.]
    setting_s02 = ['2000-002', 'B-MLS1-5-15', -15.5]
    setting_s03 = ['2000-003', 'B-MLS1-6-17', -17.]
    setting_s04 = ['2000-004', 'B-MLS1-7-19', -19.]

    environment = [names.name_pH,
                   names.name_EC,
                   names.name_redox,
                   names.name_oxygen,
                   names.name_nitrate,
                   names.name_nitrite,
                   names.name_sulfate,
                   names.name_ammonium,
                   names.name_sulfide,
                   names.name_methane,
                   names.name_iron2,
                   names.name_manganese2,
                   names.name_phosphate]

    environment_units = [' ',names.unit_microsimpercm,names.unit_millivolt,
                         mgl,mgl,mgl,mgl,mgl,mgl,mgl,mgl,mgl,mgl]
    environment_s01 = [7.23, 322., -208.,0.3,122.,0.58, 23., 5., 0., 748., 3., 1.,1.6]
    environment_s02 = [7.67, 405., -231.,0.9,5.,0.0, 0., 6., 0., 2022., 1., 0.,0]
    environment_s03 = [7.75, 223., -252.,0.1,3.,0.03, 1., 13., 0., 200., 1., 0.,0.8]
    environment_s04 = [7.53, 58., -317.,0., 180.,1., 9., 15., 6., 122., 0., 0.,0.1]

    contaminants = [names.name_benzene,
                    names.name_toluene,
                    names.name_ethylbenzene,
                    names.name_pm_xylene,
                    names.name_o_xylene,
                    names.name_indane,
                    names.name_indene,
                    names.name_naphthalene]

    contaminants_units = [microgl,microgl,microgl,microgl,
                          microgl,microgl,microgl,microgl]
    contaminants_s01 = [263., 2., 269., 14., 51., 1254., 41., 2207.]
    contaminants_s02 = [179., 7., 1690., 751., 253., 1352., 15., 5410.]
    contaminants_s03 = [853., 17., 1286., 528., 214., 1031., 31., 3879.]
    contaminants_s04 = [1254., 10., 1202., 79., 61., 814., 59., 1970.]

    metabolites = [names.name_phenol,
                   names.name_cinnamic_acid,
                   names.name_benzoic_acid]

    metabolites_units = [microgl,microgl,microgl]
    metabolites_s01 = [0.2, 0.4, 1.4]
    metabolites_s02 = [np.nan, 0.1, 0.]
#    metabolites_s03 = [0., 11.4, 5.4]
    metabolites_s03 = [11.4,np.inf, 5.4]
    metabolites_s04 = [0.3, 0.5, 0.7]

    # isotopes = ['delta_13C-benzene','delta_2H-benzene']
    isotopes = [names.name_13C+'-'+names.name_benzene,
                names.name_2H+'-'+names.name_benzene,
                ]

    isotopes_units = [names.unit_permil,names.unit_permil]
    isotopes_s01 = [-26.1,-106.]
    isotopes_s02 = [-25.8,-110.]
    isotopes_s03 = [-24.1,-118.]
    isotopes_s04 = [-24.1,-117.]

    if  data_type == 'setting':
        data = pd.DataFrame([setting_units,setting_s01,setting_s02,setting_s03,
                             setting_s04],columns = setting)

    elif  data_type == 'environment':
        units = setting_units+environment_units
        columns = setting+environment
        sample_01 = setting_s01+environment_s01
        sample_02 = setting_s02+environment_s02
        sample_03 = setting_s03+environment_s03
        sample_04 = setting_s04+environment_s04

        data = pd.DataFrame([units,sample_01,sample_02,sample_03,sample_04],
                            columns = columns)

    elif  data_type == 'contaminants':
        units = setting_units+contaminants_units
        columns = setting+contaminants
        sample_01 = setting_s01+contaminants_s01
        sample_02 = setting_s02+contaminants_s02
        sample_03 = setting_s03+contaminants_s03
        sample_04 = setting_s04+contaminants_s04

        data = pd.DataFrame([units,sample_01,sample_02,sample_03,sample_04],
                            columns = columns)

    elif  data_type == 'metabolites':

        units = setting_units+metabolites_units
        columns = setting+metabolites
        sample_01 = setting_s01+metabolites_s01
        sample_02 = setting_s02+metabolites_s02
        sample_03 = setting_s03+metabolites_s03
        sample_04 = setting_s04+metabolites_s04

        data = pd.DataFrame([units,sample_01,sample_02,sample_03,sample_04],
                            columns = columns)

    elif  data_type == 'isotopes':

        units = setting_units+isotopes_units
        columns = setting+isotopes
        sample_01 = setting_s01+isotopes_s01
        sample_02 = setting_s02+isotopes_s02
        sample_03 = setting_s03+isotopes_s03
        sample_04 = setting_s04+isotopes_s04

        data = pd.DataFrame([units,sample_01,sample_02,sample_03,sample_04],
                            columns = columns)

    elif data_type == "set_env_cont":

        units = setting_units+environment_units+contaminants_units
        columns = setting+environment+contaminants
        sample_01 = setting_s01+environment_s01+contaminants_s01
        sample_02 = setting_s02+environment_s02+contaminants_s02
        sample_03 = setting_s03+environment_s03+contaminants_s03
        sample_04 = setting_s04+environment_s04+contaminants_s04

        data = pd.DataFrame([units,sample_01,sample_02,sample_03,sample_04],
                            columns = columns)

    elif data_type == 'all':
        units = setting_units+environment_units+contaminants_units+metabolites_units + isotopes_units
        columns = setting+environment+contaminants+metabolites + isotopes
        sample_01 = setting_s01+environment_s01+contaminants_s01+metabolites_s01+isotopes_s01
        sample_02 = setting_s02+environment_s02+contaminants_s02+metabolites_s02+isotopes_s02
        sample_03 = setting_s03+environment_s03+contaminants_s03+metabolites_s03+isotopes_s03
        sample_04 = setting_s04+environment_s04+contaminants_s04+metabolites_s04+isotopes_s04

        data = pd.DataFrame([units,sample_01,sample_02,sample_03,sample_04],
                            columns = columns)

    else:
        raise ValueError("Specified data type '{}' not available".format(data_type))

    if not with_units:
        data.drop(0,inplace = True)
        for quantity in data.columns[2:]:
            data[quantity] = pd.to_numeric(data[quantity])

    return data

load_data

Functions for data I/O handling.

@author: Alraune Zech

load_csv(file_path=None, verbose=False, store_provenance=False)

Function to load data from csv file.


file_path: str
    Name of the path to the file
verbose: Boolean
    verbose flag
store_provenance: Boolean
    To add!

data: pd.DataFrame
    Tabular data
units: pd.DataFrame
    Tabular data on units

ValueError: If `file_path` is not a valid file location
Example:

This function can be called with the file path of the example data as argument using:

>>> from mibiscreen.data import load_excel
>>> load_excel(example_data.csv)
Source code in mibiscreen/data/load_data.py
def load_csv(
        file_path = None,
        verbose = False,
        store_provenance = False,
        ):
    """Function to load data from csv file.

    Args:
    -------
        file_path: str
            Name of the path to the file
        verbose: Boolean
            verbose flag
        store_provenance: Boolean
            To add!

    Returns:
    -------
        data: pd.DataFrame
            Tabular data
        units: pd.DataFrame
            Tabular data on units

    Raises:
    -------
        ValueError: If `file_path` is not a valid file location

    Example:
    -------
       This function can be called with the file path of the example data as
       argument using:

        >>> from mibiscreen.data import load_excel
        >>> load_excel(example_data.csv)

    """
    if verbose:
        print('==================================')
        print(" Running function 'load_csv()'")
        print('==================================')

    if file_path is None:
        raise ValueError('Specify file path and file name!')
    if not os.path.isfile(file_path):
        raise OSError('Cannot access file at : ',file_path)

    if verbose:
        print("Reading data from file: {}".format(file_path))
        print('------------------------------------------------------------------')

    data = pd.read_csv(file_path, encoding="unicode_escape")
    if ";" in data.iloc[1].iloc[0]:
        data = pd.read_csv(file_path, sep=";", encoding="unicode_escape")

    _check_duplicates_in_df(data)

    units = data.drop(labels = np.arange(1,data.shape[0]))

    if verbose:
        print("Units of quantities:")
        print('-------------------')
        print(units)
        print('________________________________________________________________')
        print("Loaded data as pandas DataFrame:")
        print('--------------------------------')
        print(data)
        print('================================================================')

    return data, units

load_excel(file_path=None, sheet_name=0, verbose=False, store_provenance=False, **kwargs)

Function to load data from excel file.


file_path: str
    Name of the path to the file
sheet_name: int
    Number of the sheet in the excel file to load
verbose: Boolean
    verbose flag
store_provenance: Boolean
    To add!
**kwargs: optional keyword arguments to pass to pandas' routine
    read_excel(), e.g. sep = ',' or sep = ';'

data: pd.DataFrame
    Tabular data
units: pd.DataFrame
    Tabular data on units

ValueError: If `file_path` is not a valid file location
Example:

This function can be called with the file path of the example data as argument using:

>>> from mibiscreen.data import load_excel
>>> load_excel(example_data.xlsx)
Source code in mibiscreen/data/load_data.py
def load_excel(
        file_path = None,
        sheet_name = 0,
        verbose = False,
        store_provenance = False,
        **kwargs,
        ):
    """Function to load data from excel file.

    Args:
    -------
        file_path: str
            Name of the path to the file
        sheet_name: int
            Number of the sheet in the excel file to load
        verbose: Boolean
            verbose flag
        store_provenance: Boolean
            To add!
        **kwargs: optional keyword arguments to pass to pandas' routine
            read_excel(), e.g. sep = ',' or sep = ';'

    Returns:
    -------
        data: pd.DataFrame
            Tabular data
        units: pd.DataFrame
            Tabular data on units

    Raises:
    -------
        ValueError: If `file_path` is not a valid file location

    Example:
    -------
       This function can be called with the file path of the example data as
       argument using:

        >>> from mibiscreen.data import load_excel
        >>> load_excel(example_data.xlsx)

    """
    if verbose:
        print('===================================')
        print(" Running function 'load_excel()'")
        print('===================================')

    if file_path is None:
        raise ValueError('Specify file path and file name!')
    if not os.path.isfile(file_path):
        raise OSError('Cannot access file at : ',file_path)

    data = pd.read_excel(file_path,
                         sheet_name = sheet_name,
                         **kwargs)

    if verbose:
        print("Reading data from file: {}".format(file_path))
        print('------------------------------------------------------------------')

    _check_duplicates_in_df(data)

    units = data.drop(labels = np.arange(1,data.shape[0]))

    if verbose:
        print("Unit of quantities:")
        print('-------------------')
        print(units)
        print('________________________________________________________________')
        print("Loaded data as pandas DataFrame:")
        print('--------------------------------')
        print(data)
        print('================================================================')

    return data, units

set_data

Functions for data extraction and merging in preparation of analysis and plotting.

@author: Alraune Zech

compare_lists(list1, list2, verbose=False)

Checking overlap of two given list.

Input
list1: list of strings
    given extensive list (usually column names of a pd.DataFrame)
list2: list of strings
    list of names to extract/check overlap with strings in list 'column'
verbose: Boolean, default True
    verbosity flag
Output
(intersection, remainder_list1, reminder_list2): tuple of lists
    * intersection: list of strings present in both lists 'list1' and 'list2'
    * remainder_list1: list of strings only present in 'list1'
    * remainder_list2: list of strings only present in 'list2'
Example:

list1 = [‘test1’,’test2’] list2 = [‘test1’,’test3’]

([‘test1’],[‘test2’]['test3']) = compare_lists(list1,list2)

Source code in mibiscreen/data/set_data.py
def compare_lists(list1,
                  list2,
                  verbose = False,
                  ):
    """Checking overlap of two given list.

    Input
    -----
        list1: list of strings
            given extensive list (usually column names of a pd.DataFrame)
        list2: list of strings
            list of names to extract/check overlap with strings in list 'column'
        verbose: Boolean, default True
            verbosity flag

    Output
    ------
        (intersection, remainder_list1, reminder_list2): tuple of lists
            * intersection: list of strings present in both lists 'list1' and 'list2'
            * remainder_list1: list of strings only present in 'list1'
            * remainder_list2: list of strings only present in 'list2'

    Example:
    -------
    list1 = ['test1','test2']
    list2 =  ['test1','test3']

    (['test1'],['test2']['test3']) = compare_lists(list1,list2)

    """
    intersection = list(set(list1) & set(list2))
    remainder_list1 = list(set(list1) - set(list2))
    remainder_list2 = list(set(list2) - set(list1))

    if verbose:
        print('================================================================')
        print(" Running function 'extract_variables()'")
        print('================================================================')
        print("strings present in both lists:", intersection)
        print("strings only present in either of the lists:", remainder_list1 +  remainder_list2)

    return (intersection,remainder_list1,remainder_list2)

determine_quantities(cols, name_list='all', verbose=False)

Select a subset of column names (from DataFrame).

Input
cols: list
    Names of quantities (column names) from pd.DataFrame
name_list: str or list of str, default is 'all'
    quantities to extract from column names.

    If a list of strings is provided, these will be selected from the list of column names (col)
    If a string is provided, this is a short name for a specific group of quantities:
        - 'all' (all quantities given in data frame except settings)
        - short name for group of contaminants:
            - 'BTEX' (for benzene, toluene, ethylbenzene, xylene)
            - 'BTEXIIN' (for benzene, toluene, ethylbenzene, xylene,
                          indene, indane and naphthaline)
            - 'all_cont' (for all contaminant in name list)
        - short name for group of environmental parameters/geochemicals:
            - 'environmental_conditions'
            - 'geochemicals'
            - 'ONS':  non reduced electron acceptors (oxygen, nitrate, sulfate)
            - 'ONSFe': selected electron acceptors  (oxygen, nitrate, sulfate + iron II)
            - 'all_ea': all potential electron acceptors (non reduced & reduced)
            - 'NP': nutrients (nitrate, nitrite, phosphate)
        See also file mibiscreen/data/name_data for lists of quantities
verbose: Boolean
    verbose flag (default False)
Output
quantities: list
    list of strings with names of selected quantities present in dataframe
remainder: list
    list of strings with names of selected quantities not present in dataframe
Source code in mibiscreen/data/set_data.py
def determine_quantities(cols,
         name_list = 'all',
         verbose = False,
         ):
    """Select a subset of column names (from DataFrame).

    Input
    -----
        cols: list
            Names of quantities (column names) from pd.DataFrame
        name_list: str or list of str, default is 'all'
            quantities to extract from column names.

            If a list of strings is provided, these will be selected from the list of column names (col)
            If a string is provided, this is a short name for a specific group of quantities:
                - 'all' (all quantities given in data frame except settings)
                - short name for group of contaminants:
                    - 'BTEX' (for benzene, toluene, ethylbenzene, xylene)
                    - 'BTEXIIN' (for benzene, toluene, ethylbenzene, xylene,
                                  indene, indane and naphthaline)
                    - 'all_cont' (for all contaminant in name list)
                - short name for group of environmental parameters/geochemicals:
                    - 'environmental_conditions'
                    - 'geochemicals'
                    - 'ONS':  non reduced electron acceptors (oxygen, nitrate, sulfate)
                    - 'ONSFe': selected electron acceptors  (oxygen, nitrate, sulfate + iron II)
                    - 'all_ea': all potential electron acceptors (non reduced & reduced)
                    - 'NP': nutrients (nitrate, nitrite, phosphate)
                See also file mibiscreen/data/name_data for lists of quantities
        verbose: Boolean
            verbose flag (default False)

    Output
    ------
        quantities: list
            list of strings with names of selected quantities present in dataframe
        remainder: list
            list of strings with names of selected quantities not present in dataframe

    """
    if name_list == 'all':
        ### choosing all column names except those of settings
        list_names = list(set(cols) - set(sample_settings))
        if verbose:
            print("Selecting all data columns except for those with settings.")

    elif isinstance(name_list, str):
        if name_list in contaminant_groups.keys():
            verbose_text = "Selecting specific group of contaminants:"
            list_names = contaminant_groups[name_list].copy()
            if (names.name_o_xylene in cols) and (names.name_pm_xylene in cols):
                list_names.remove(names.name_xylene) # handling of xylene isomeres

        elif name_list in environment_groups.keys():
            verbose_text = "Selecting specific group of geochemicals:"
            list_names = environment_groups[name_list].copy()

        else:
            verbose_text = "Selecting single quantity:"
            list_names = [name_list]

        if verbose:
            print(verbose_text,name_list)
            print('_____________________________________________________________')

    elif isinstance(name_list, list): # choosing specific list of column names except those of settings
        if not all(isinstance(item, str) for item in name_list):
            raise ValueError("Keyword 'name_list' needs to be a string or a list of strings.")
        list_names = name_list
        if verbose:
            print("Selecting all names from provided list.")

    else:
        raise ValueError("Keyword 'name_list' needs to be a string or a list of strings.")

    quantities,_,remainder_list2 = compare_lists(cols,list_names)

    if not quantities:
        raise ValueError("No quantities from name list '{}' provided in data.\
                         Presumably data not in standardized format. \
                         Run 'standardize()' first.".format(name_list))

    if verbose:
        print("Selected set of quantities: \n---------------------------")
        print(*quantities,sep='\n')
        print('_____________________________________________________________')

    if remainder_list2:
        print("WARNING: There are quantities from name list not in data")
        if verbose:
            print(*remainder_list2,sep='\n')
        print("Maybe data not in standardized format. Run 'standardize()' first.")
        print("_________________________________________________________________")


    return quantities,remainder_list2

extract_data(data_frame, name_list, keep_setting_data=True, verbose=False)

Extracting data of specified variables from dataframe.


data_frame: pandas.DataFrames
    dataframe with the measurements
name_list: list of strings
    list of column names to extract from dataframe
keep_setting_data: bool, default True
    Whether to keep setting data in the DataFrame.
verbose: Boolean
    verbose flag (default False)

data: pd.DataFrame
    dataframe with the measurements
Raises:

None (yet).

Example:

To be added.

Source code in mibiscreen/data/set_data.py
def extract_data(data_frame,
                 name_list,
                 keep_setting_data = True,
                 verbose = False,
                 ):
    """Extracting data of specified variables from dataframe.

    Args:
    -------
        data_frame: pandas.DataFrames
            dataframe with the measurements
        name_list: list of strings
            list of column names to extract from dataframe
        keep_setting_data: bool, default True
            Whether to keep setting data in the DataFrame.
        verbose: Boolean
            verbose flag (default False)

    Returns:
    -------
        data: pd.DataFrame
            dataframe with the measurements

    Raises:
    -------
    None (yet).

    Example:
    -------
    To be added.

    """
    ### check on correct data input format and extracting column names as list
    data,cols= check_data_frame(data_frame,inplace = False)

    quantities, _ = determine_quantities(cols,
                                      name_list = name_list,
                                      verbose = verbose)

    if keep_setting_data:
        settings,_,_ = compare_lists(cols,sample_settings)
        i1,quantities_without_settings,_ = compare_lists(quantities,settings)
        columns_names = settings + quantities_without_settings

    else:
        columns_names = quantities

    return data[columns_names]

extract_settings(data_frame, verbose=False)

Extracting data of specified variables from dataframe.


data_frame: pandas.DataFrames
    dataframe with the measurements
verbose: Boolean
    verbose flag (default False)

data: pd.DataFrame
    dataframe with settings
Raises:

None (yet).

Example:

To be added.

Source code in mibiscreen/data/set_data.py
def extract_settings(data_frame,
                     verbose = False,
                     ):
    """Extracting data of specified variables from dataframe.

    Args:
    -------
        data_frame: pandas.DataFrames
            dataframe with the measurements
        verbose: Boolean
            verbose flag (default False)

    Returns:
    -------
        data: pd.DataFrame
            dataframe with settings

    Raises:
    -------
    None (yet).

    Example:
    -------
    To be added.

    """
    ### check on correct data input format and extracting column names as list
    data,cols= check_data_frame(data_frame,inplace = False)

    settings,r1,r2 = compare_lists(cols,sample_settings)

    if verbose:
        print("Settings available in data: ", settings)

    return data[settings]

merge_data(data_frames_list, how='outer', on=[names.name_sample], clean=True, **kwargs)

Merging dataframes along columns on similar sample name.


data_frames_list: list of pd.DataFrame
    list of dataframes with the measurements
how: str, default 'outer'
    Type of merge to be performed.
    corresponds to keyword in pd.merge()
    {‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default ‘outer’
on: list, default "sample_nr"
    Column name(s) to join on.
    corresponds to keyword in pd.merge()
clean: Boolean, default True
    Whether to drop columns which are in all provided data_frames
    (on which not to merge, potentially other settings than sample_name)
**kwargs: dict
    optional keyword arguments to be passed to pd.merge()

data: pd.DataFrame
    dataframe with the measurements
Raises:

None (yet).

Example:

To be added.

Source code in mibiscreen/data/set_data.py
def merge_data(data_frames_list,
               how='outer',
               on=[names.name_sample],
               clean = True,
               **kwargs,
               ):
    """Merging dataframes along columns on similar sample name.

    Args:
    -------
        data_frames_list: list of pd.DataFrame
            list of dataframes with the measurements
        how: str, default 'outer'
            Type of merge to be performed.
            corresponds to keyword in pd.merge()
            {‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default ‘outer’
        on: list, default "sample_nr"
            Column name(s) to join on.
            corresponds to keyword in pd.merge()
        clean: Boolean, default True
            Whether to drop columns which are in all provided data_frames
            (on which not to merge, potentially other settings than sample_name)
        **kwargs: dict
            optional keyword arguments to be passed to pd.merge()

    Returns:
    -------
        data: pd.DataFrame
            dataframe with the measurements

    Raises:
    -------
    None (yet).

    Example:
    -------
    To be added.

    """
    if len(data_frames_list)<2:
        raise ValueError('Provide List of DataFrames.')


    data_merge = data_frames_list[0]
    for data_add in data_frames_list[1:]:
        if clean:
            intersection,remainder_list1,remainder_list2 = compare_lists(
                data_merge.columns.to_list(),data_add.columns.to_list())
            intersection,remainder_list1,remainder_list2 = compare_lists(intersection,on)
            data_add = data_add.drop(labels = remainder_list1+remainder_list2,axis = 1)
        data_merge = pd.merge(data_merge,data_add, how=how, on=on,**kwargs)
        # complete data set, where values of porosity are added (otherwise nan)

    return data_merge

settings

mibiscreen module for settings of data.

contaminants

Specifications of petroleum hydrocarbon related contaminants.

List of (PAH) contamiants measured in groundwater samples useful for biodegredation and bioremediation analysis

@author: A. Zech

environment

Specifications of geochemicals.

List of geochemicals measured in groundwater samples useful for biodegredation and bioremediation analysis

@author: A. Zech

isotopes

Specifications of isotopes.

List of basic isotopes measured in groundwater samples useful for biodegredation and bioremediation analysis

@author: A. Zech

metabolites

Specifications of metabolies.

List of basic metabolites measured in groundwater samples useful for biodegredation and bioremediation analysis

@author: A. Zech

sample_settings

Specifications of sample settings.

List of all quantities and parameters characterizing sample specifics for measurements in groundwater samples useful for biodegredation and bioremediation analysis

@author: A. Zech

standard_names

Name specifications of data!

Listed are all standard names of quantities and parameters measured in groundwater samples useful for biodegredation and bioremediation analysis

@author: A. Zech

unit_settings

Unit specifications of data!

File containing unit specifications of quantities and parameters measured in groundwater samples useful for biodegredation and bioremediation analysis.

@author: Alraune Zech