Skip to content

mibiscreen.data API reference

mibiscreen module for data handling.

check_data

Functions for data handling and standardization.

@author: Alraune Zech

check_columns(data_frame, standardize=False, reduce=False, verbose=True)

Function checking names of columns of data frame.

Function that looks at the column names and links it to standard names. Optionally, it renames identified column names to the standard names of the model.


data_frame: pd.DataFrame
    dataframe with the measurements
standardize: Boolean, default False
    Whether to standardize identified column names
reduce: Boolean, default False
    Whether to reduce data to known quantities
verbose: Boolean, default True
    verbosity flag

tuple: three list containing names of
        list with identitied quantities in data (but not standardized names)
        list with unknown quantities in data (not in list of standardized names)
        list with standard names of identified quantities
Raises:

None (yet).

Example:

Todo’s: - complete list of potential contaminants, environmental factors - add name check for metabolites?

Source code in mibiscreen/data/check_data.py
def check_columns(data_frame,
                  standardize = False,
                  reduce = False,
                  verbose = True):
    """Function checking names of columns of data frame.

    Function that looks at the column names and links it to standard names.
    Optionally, it renames identified column names to the standard names of the model.

    Args:
    -------
        data_frame: pd.DataFrame
            dataframe with the measurements
        standardize: Boolean, default False
            Whether to standardize identified column names
        reduce: Boolean, default False
            Whether to reduce data to known quantities
        verbose: Boolean, default True
            verbosity flag

    Returns:
    -------
        tuple: three list containing names of
                list with identitied quantities in data (but not standardized names)
                list with unknown quantities in data (not in list of standardized names)
                list with standard names of identified quantities

    Raises:
    -------
    None (yet).

    Example:
    -------
    Todo's:
        - complete list of potential contaminants, environmental factors
        - add name check for metabolites?
    """
    if verbose:
        print('==============================================================')
        print(" Running function 'check_columns()' on data")
        print('==============================================================')

    data,cols= check_data_frame(data_frame,
                                sample_name_to_index = False,
                                inplace = True)

    results = standard_names(cols,
                             standardize = False,
                             reduce = False,
                             verbose = False,
                             )

    column_names_standard = results[0]
    column_names_known = results[1]
    column_names_unknown = results[2]
    column_names_transform = results[3]

    if standardize:
        data.columns = [column_names_transform.get(x, x) for x in data.columns]

    if reduce:
        data.drop(labels = column_names_unknown,axis = 1,inplace=True)

    if verbose:
        print("{} quantities identified in provided data.".format(len(column_names_known)))
        print("List of names with standard names:")
        print('----------------------------------')
        for i,name in enumerate(column_names_known):
            print(name," --> ",column_names_standard[i])
        print('----------------------------------')
        if standardize:
            print("Identified column names have been standardized")
        else:
            print("\nRenaming can be done by setting keyword 'standardize' to True.\n")
        print('________________________________________________________________')
        print("{} quantities have not been identified in provided data:".format(len(column_names_unknown)))
        print('---------------------------------------------------------')
        for i,name in enumerate(column_names_unknown):
            print(name)
        print('---------------------------------------------------------')
        if reduce:
            print("Not identified quantities have been removed from data frame")
        else:
            print("\nReduction to known quantities can be done by setting keyword 'reduce' to True.\n")
        print('================================================================')

    return (column_names_known,column_names_unknown,column_names_standard)

check_data_frame(data_frame, sample_name_to_index=False, inplace=False)

Checking data on correct format.

Tests if provided data is a pandas data frame and provides column names. Optionally it sets the sample name as index.

Input
data_frame: pd.DataFrame
    quantities for data analysis given per sample
sample_name_to_index:  Boolean, default False
    Whether to set the sample name to the index of the DataFrame
inplace: Boolean, default False
    Whether to modify the DataFrame rather than creating a new one.
Output
data: pd.DataFrame
    copy of given dataframe with index set to sample name
cols: list
    List of column names
Source code in mibiscreen/data/check_data.py
def check_data_frame(data_frame,
                     sample_name_to_index = False,
                     inplace = False,
                     ):
    """Checking data on correct format.

    Tests if provided data is a pandas data frame and provides column names.
    Optionally it sets the sample name as index.

    Input
    -----
        data_frame: pd.DataFrame
            quantities for data analysis given per sample
        sample_name_to_index:  Boolean, default False
            Whether to set the sample name to the index of the DataFrame
        inplace: Boolean, default False
            Whether to modify the DataFrame rather than creating a new one.

    Output
    ------
        data: pd.DataFrame
            copy of given dataframe with index set to sample name
        cols: list
            List of column names
    """
    if not isinstance(data_frame, pd.DataFrame):
        raise ValueError("Data has to be a panda-DataFrame or Series \
                          but is given as type {}".format(type(data_frame)))

    if inplace is False:
        data = data_frame.copy()
    else:
        data = data_frame

    if sample_name_to_index:
        if names.name_sample not in data.columns:
            print("Warning: No sample name provided for making index. Consider standardizing data first")
        else:
            data.set_index(names.name_sample,inplace = True)

    if isinstance(data, pd.Series):
        cols = [data.name]
    else:
        cols = data.columns.to_list()

    return data, cols

check_units(data, verbose=True)

Function to check the units of the measurements.


data: pandas.DataFrames
    dataframe with the measurements where first row contains
    the units or a dataframe with only the column names and units
verbose: Boolean
    verbose statement (default True)

col_check_list: list
    quantities whose units need checking/correction

None (yet).
Example:
To be added.
Source code in mibiscreen/data/check_data.py
def check_units(data,
                verbose = True):
    """Function to check the units of the measurements.

    Args:
    -------
        data: pandas.DataFrames
            dataframe with the measurements where first row contains
            the units or a dataframe with only the column names and units
        verbose: Boolean
            verbose statement (default True)

    Returns:
    -------
        col_check_list: list
            quantities whose units need checking/correction

    Raises:
    -------
        None (yet).

    Example:
    -------
        To be added.
    """
    if verbose:
        print('================================================================')
        print(" Running function 'check_units()' on data")
        print('================================================================')

    if not isinstance(data, pd.DataFrame):
        raise ValueError("Provided data is not a data frame.")
    elif data.shape[0]>1:
        units = data.drop(labels = np.arange(1,data.shape[0]))
    else:
        units = data.copy()

    ### testing if provided data frame contains any units (at all)
    units_in_data = set(map(lambda x: str(x).lower(), units.iloc[0,:].values))
    test_unit = False
    for u in all_units:
        if u in units_in_data:
            test_unit = True
            break
    if not test_unit:
        raise ValueError("Error: The second line in the dataframe is supposed\
                         to specify the units. No units were detected in this\
                         line, check https://mibipret.github.io/mibiscreen/ Data\
                         documentation.")

    # standardize column names (as it might not has happened for data yet)
    check_columns(units,standardize = True, verbose = False)
    col_check_list= []
    col_not_checked  = []


    properties_all = {**properties_sample_settings,
                      **properties_geochemicals,
                      **properties_contaminants,
                      **properties_metabolites,
                      **properties_isotopes,
    }

    ### run through all quantity columns and check their units
    for quantity in units.columns:
        if quantity in properties_all.keys():
            standard_unit = properties_all[quantity]['standard_unit']
        elif quantity.split('-')[0] in properties_all.keys(): # test on isotope
            standard_unit = properties_all[quantity.split('-')[0]]['standard_unit']
        else:
            col_not_checked.append(quantity)
            continue

        if standard_unit != names.unit_less:
            other_names_unit = properties_units[standard_unit]['other_names']
            if str(units[quantity][0]).lower() not in other_names_unit:
                col_check_list.append(quantity)
                if verbose:
                    print("Warning: Check unit of {}!\n Given in {}, but must be in {}."
                              .format(quantity,units[quantity][0],standard_unit))

    if verbose:
        print('________________________________________________________________')
        if len(col_check_list) == 0:
            print(" All identified quantities given in requested units.")
        else:
            print(" All other identified quantities given in requested units.")
        print(" Quantities not identified (and thus not checked on units:", col_not_checked)
        print('================================================================')

    return col_check_list

check_values(data_frame, inplace=False, verbose=True)

Function that checks on value types and replaces non-measured values.


data_frame: pandas.DataFrames
    dataframe with the measurements (without first row of units)
inplace: Boolean, default False
    Whether to modify the DataFrame rather than creating a new one.
verbose: Boolean
    verbose statement (default True)

data_pure: pandas.DataFrame
    Tabular data with standard column names and without units

None (yet).
Example:
To be added.
Source code in mibiscreen/data/check_data.py
def check_values(data_frame,
                 inplace = False,
                 verbose = True,
                 ):
    """Function that checks on value types and replaces non-measured values.

    Args:
    -------
        data_frame: pandas.DataFrames
            dataframe with the measurements (without first row of units)
        inplace: Boolean, default False
            Whether to modify the DataFrame rather than creating a new one.
        verbose: Boolean
            verbose statement (default True)

    Returns:
    -------
        data_pure: pandas.DataFrame
            Tabular data with standard column names and without units

    Raises:
    -------
        None (yet).

    Example:
    -------
        To be added.
    """
    if verbose:
        print('================================================================')
        print(" Running function 'check_values()' on data")
        print('================================================================')

    data,cols= check_data_frame(data_frame, inplace = inplace)

    ### testing if provided data frame contains first row with units
    for u in data.iloc[0].to_list():
        if u in all_units:
            print("WARNING: First row identified as units, has been removed for value check")
            print('________________________________________________________________')
            data.drop(labels = 0,inplace = True)
            break

    for sign in to_replace_list:
        data.iloc[:,:] = data.iloc[:,:].replace(to_replace=sign, value=to_replace_value)

    # standardize column names (as it might not has happened for data yet)
    # check_columns(data,
    #               standardize = True,
    #               check_metabolites=True,
    #               verbose = False)

    # transform data to numeric values
    quantities_transformed = []
    for quantity in cols: #data.columns:
        try:
            # data_pure.loc[:,quantity] = pd.to_numeric(data_pure.loc[:,quantity])
            data[quantity] = pd.to_numeric(data[quantity])
            quantities_transformed.append(quantity)
        except ValueError:
            print("WARNING: Cound not transform '{}' to numerical values".format(quantity))
            print('________________________________________________________________')
    if verbose:
        print("Quantities with values transformed to numerical (int/float):")
        print('-----------------------------------------------------------')
        for name in quantities_transformed:
            print(name)
        print('================================================================')

    return data

standard_names(name_list, standardize=True, reduce=False, verbose=False)

Function transforming list of names to standard names.

Function that looks at the names (of e.g. environmental variables, contaminants, metabolites, isotopes, etc) and provides the corresponding standard names.


name_list: string or list of strings
    names of quantities to be transformed to standard
standardize: Boolean, default False
    Whether to standardize identified column names
reduce: Boolean, default False
    Whether to reduce data to known quantities
verbose: Boolean, default True
    verbosity flag

tuple: three list containing names of
        list with identitied quantities in data (but not standardized names)
        list with unknown quantities in data (not in list of standardized names)
        list with standard names of identified quantities
Raises:

None (yet).

Example:

Todo’s: - complete list of potential contaminants, environmental factors - add name check for metabolites?

Source code in mibiscreen/data/check_data.py
def standard_names(name_list,
                   standardize = True,
                   reduce = False,
                   verbose = False,
                   ):
    """Function transforming list of names to standard names.

    Function that looks at the names (of e.g. environmental variables, contaminants,
    metabolites, isotopes, etc) and provides the corresponding standard names.

    Args:
    -------
        name_list: string or list of strings
            names of quantities to be transformed to standard
        standardize: Boolean, default False
            Whether to standardize identified column names
        reduce: Boolean, default False
            Whether to reduce data to known quantities
        verbose: Boolean, default True
            verbosity flag

    Returns:
    -------
        tuple: three list containing names of
                list with identitied quantities in data (but not standardized names)
                list with unknown quantities in data (not in list of standardized names)
                list with standard names of identified quantities

    Raises:
    -------
    None (yet).

    Example:
    -------
    Todo's:
        - complete list of potential contaminants, environmental factors
        - add name check for metabolites?
    """
    names_standard = []
    names_known = []
    names_unknown = []
    names_transform = {}


    if isinstance(name_list, str):
        name_list = [name_list]
    elif isinstance(name_list, list):
        for name in name_list:
            if not isinstance(name, str):
                raise ValueError("Entry in provided list of names is not a string:", name)

    properties_all = {**properties_sample_settings,
                      **properties_geochemicals,
                      **properties_contaminants,
                      **properties_metabolites,
                      **properties_isotopes,
                      **contaminants_analysis,
    }
    dict_names=_generate_dict_other_names(properties_all)

    other_names_contaminants = _generate_dict_other_names(properties_contaminants)
    other_names_isotopes = _generate_dict_other_names(properties_isotopes)

     # dict_names= other_names_all.copy()

    for x in name_list:
        y = dict_names.get(x, False)
        x_isotope = x.split('-')[0]
        y_isotopes = other_names_isotopes.get(x_isotope.lower(), False)

        if y_isotopes is not False:
            x_molecule = x.removeprefix(x_isotope+'-')
            y_molecule = other_names_contaminants.get(x_molecule.lower(), False)
            if y_molecule is False:
                names_unknown.append(x)
            else:
                y = y_isotopes+'-'+y_molecule
                names_known.append(x)
                names_standard.append(y)
                names_transform[x] = y
        else:
            y = dict_names.get(x.lower(), False)
            if y is False:
                names_unknown.append(x)
            else:
                names_known.append(x)
                names_standard.append(y)
                names_transform[x] = y

    if verbose:
        print('================================================================')
        print(" Running function 'standard_names()'")
        print('================================================================')
        print("{} of {} quantities identified in name list.".format(len(names_known),len(name_list)))
        print("List of names with standard names:")
        print('----------------------------------')
        for i,name in enumerate(names_known):
            print(name," --> ",names_standard[i])
        print('----------------------------------')
        if standardize:
            print("Identified column names have been standardized")
        else:
            print("\nRenaming can be done by setting keyword 'standardize' to True.\n")
        print('________________________________________________________________')
        print("{} quantities have not been identified in provided data:".format(len(names_unknown)))
        print("You can suggest missing quantities that could be added to the library here: <https://github.com/MiBiPreT/mibiscreen/issues/new/choose>")
        print('---------------------------------------------------------')
        for i,name in enumerate(names_unknown):
            print(name)
        print('---------------------------------------------------------')
        if reduce:
            print("Not identified quantities have been removed from data frame")
        else:
            print("\nReduction to known quantities can be done by setting keyword 'reduce' to True.\n")
        print('================================================================')

    if standardize:
        if reduce:
            return names_standard
        else:
            return names_standard + names_unknown
    else:
        return (names_standard, names_known, names_unknown, names_transform)

standardize(data_frame, reduce=True, store_csv=False, verbose=True)

Function providing condensed data frame with standardized names.

Function is checking names of columns and renames columns, condenses data to identified column names, checks units and names sof data frame.

Function that looks at the column names and renames the columns to the standard names of the model.


data_frame: pandas.DataFrames
    dataframe with the measurements
check_metabolites: Boolean, default False
    whether to check on metabolites' values
reduce: Boolean, default True
    whether to reduce data to known quantities (default True),
    otherwise full dataframe with renamed columns (for those identifyable) is returned
store_csv: Boolean, default False
    whether to save dataframe in standard format to csv-file
verbose: Boolean, default True
    verbose statement

data_numeric, units: pandas.DataFrames
    Tabular data with standardized column names, values in numerics etc
    and table with units for standardized column names

None (yet).
Example:

Todo’s: - complete list of potential contaminants, environmental factors - add name check for metabolites? - add key-word to specify which data to extract (i.e. data columns to return)

Source code in mibiscreen/data/check_data.py
def standardize(data_frame,
                reduce = True,
                store_csv = False,
                verbose=True,
                ):
    """Function providing condensed data frame with standardized names.

    Function is checking names of columns and renames columns,
    condenses data to identified column names, checks units and  names
    sof data frame.

    Function that looks at the column names and renames the columns to
    the standard names of the model.

    Args:
    -------
        data_frame: pandas.DataFrames
            dataframe with the measurements
        check_metabolites: Boolean, default False
            whether to check on metabolites' values
        reduce: Boolean, default True
            whether to reduce data to known quantities (default True),
            otherwise full dataframe with renamed columns (for those identifyable) is returned
        store_csv: Boolean, default False
            whether to save dataframe in standard format to csv-file
        verbose: Boolean, default True
            verbose statement

    Returns:
    -------
        data_numeric, units: pandas.DataFrames
            Tabular data with standardized column names, values in numerics etc
            and table with units for standardized column names

    Raises:
    -------
        None (yet).

    Example:
    -------
    Todo's:
        - complete list of potential contaminants, environmental factors
        - add name check for metabolites?
        - add key-word to specify which data to extract
            (i.e. data columns to return)

    """
    if verbose:
        print('================================================================')
        print(" Running function 'standardize()' on data")
        print('================================================================')
        print(' Function performing check of data including:')
        print('  * check of column names and standardizing them.')
        print('  * check of units and outlining which to adapt.')
        print('  * check of values, replacing empty values by nan \n    and making them numeric')

    data,cols= check_data_frame(data_frame,
                                sample_name_to_index = False,
                                inplace = False)

    # general column check & standardize column names
    check_columns(data,
                  standardize = True,
                  reduce = reduce,
                  verbose = verbose)

    # general unit check
    units = data.drop(labels = np.arange(1,data.shape[0]))
    col_check_list = check_units(units,
                                 verbose = verbose)

    # transform data to numeric values
    data_numeric = check_values(data.drop(labels = 0),
                                inplace = False,
                                verbose = verbose)

    # store standard data to file
    if store_csv:
        if len(col_check_list) != 0:
            print('________________________________________________________________')
            print("Data could not be saved because not all identified \n quantities are given in requested units.")
        else:
            try:
                data.to_csv(store_csv,index=False)
                if verbose:
                    print('________________________________________________________________')
                    print("Save standardized dataframe to file:\n", store_csv)
            except OSError:
                print("WARNING: data could not be saved. Check provided file path and name: {}".format(store_csv))
    if verbose:
        print('================================================================')

    return data_numeric, units

example_data

mibiscreen module for example data.

example_data

Example data.

Measurements on quantities and parameters in groundwater samples used for biodegredation and bioremediation analysis.

@author: Alraune Zech

example_data(data_type='all', with_units=False)

Function provinging test data for mibiscreen data analysis.


data_type: string
    Type of data to return:
        -- "all": all types of data available
        -- "set_env_cont": well setting, environmental and contaminants data
        -- "setting": well setting data only
        -- "environment": data on environmental
        -- "contaminants": data on contaminants
        -- "metabolites": data on metabolites
        -- "isotopes": data on isotopes
        -- "hydro": data on hydrogeolocial conditions
with_units: Boolean, default False
    flag to provide first row with units
    if False (no units), values in columns will be numerical
    if True (with units), values in columns will be objects

pandas.DataFrame: Tabular data with standard column names

None
Example:
To be added!
Source code in mibiscreen/data/example_data/example_data.py
def example_data(data_type = 'all',
                 with_units = False,
                 ):
    """Function provinging test data for mibiscreen data analysis.

    Args:
    -------
        data_type: string
            Type of data to return:
                -- "all": all types of data available
                -- "set_env_cont": well setting, environmental and contaminants data
                -- "setting": well setting data only
                -- "environment": data on environmental
                -- "contaminants": data on contaminants
                -- "metabolites": data on metabolites
                -- "isotopes": data on isotopes
                -- "hydro": data on hydrogeolocial conditions
        with_units: Boolean, default False
            flag to provide first row with units
            if False (no units), values in columns will be numerical
            if True (with units), values in columns will be objects

    Returns:
    -------
        pandas.DataFrame: Tabular data with standard column names

    Raises:
    -------
        None

    Example:
    -------
        To be added!
    """
    mgl = names.unit_mgperl
    microgl = names.unit_microgperl

    setting = [names.name_sample,names.name_observation_well,names.name_sample_depth]
    setting_units = [' ',' ',names.unit_meter]
    setting_s01 = ['2000-001', 'B-MLS1-3-12', -12.]
    setting_s02 = ['2000-002', 'B-MLS1-5-15', -15.5]
    setting_s03 = ['2000-003', 'B-MLS1-6-17', -17.]
    setting_s04 = ['2000-004', 'B-MLS1-7-19', -19.]

    environment = [names.name_pH,
                   names.name_EC,
                   names.name_redox,
                   names.name_oxygen,
                   names.name_nitrate,
                   names.name_nitrite,
                   names.name_sulfate,
                   names.name_ammonium,
                   names.name_sulfide,
                   names.name_methane,
                   names.name_iron2,
                   names.name_manganese2,
                   names.name_phosphate]

    environment_units = [' ',names.unit_microsimpercm,names.unit_millivolt,
                         mgl,mgl,mgl,mgl,mgl,mgl,mgl,mgl,mgl,mgl]
    environment_s01 = [7.23, 322., -208.,0.3,122.,0.58, 23., 5., 0., 748., 3., 1.,1.6]
    environment_s02 = [7.67, 405., -231.,0.9,5.,0.0, 0., 6., 0., 2022., 1., 0.,0]
    environment_s03 = [7.75, 223., -252.,0.1,3.,0.03, 1., 13., 0., 200., 1., 0.,0.8]
    environment_s04 = [7.53, 58., -317.,0., 180.,1., 9., 15., 6., 122., 0., 0.,0.1]

    contaminants = [names.name_benzene,
                    names.name_toluene,
                    names.name_ethylbenzene,
                    names.name_pm_xylene,
                    names.name_o_xylene,
                    names.name_indane,
                    names.name_indene,
                    names.name_naphthalene]

    contaminants_units = [microgl,microgl,microgl,microgl,
                          microgl,microgl,microgl,microgl]
    contaminants_s01 = [263., 2., 269., 14., 51., 1254., 41., 2207.]
    contaminants_s02 = [179., 7., 1690., 751., 253., 1352., 15., 5410.]
    contaminants_s03 = [853., 17., 1286., 528., 214., 1031., 31., 3879.]
    contaminants_s04 = [1254., 10., 1202., 79., 61., 814., 59., 1970.]

    metabolites = [names.name_phenol,
                   names.name_cinnamic_acid,
                   names.name_benzoic_acid]

    metabolites_units = [microgl,microgl,microgl]
    metabolites_s01 = [0.2, 0.4, 1.4]
    metabolites_s02 = [np.nan, 0.1, 0.]
    metabolites_s03 = [0., 11.4, 5.4]
    metabolites_s04 = [0.3, 0.5, 0.7]

    # isotopes = ['delta_13C-benzene','delta_2H-benzene']
    isotopes = [names.name_13C+'-'+names.name_benzene,
                names.name_2H+'-'+names.name_benzene,
                ]

    isotopes_units = [names.unit_permil,names.unit_permil]
    isotopes_s01 = [-26.1,-106.]
    isotopes_s02 = [-25.8,-110.]
    isotopes_s03 = [-24.1,-118.]
    isotopes_s04 = [-24.1,-117.]

    if  data_type == 'setting':
        data = pd.DataFrame([setting_units,setting_s01,setting_s02,setting_s03,
                             setting_s04],columns = setting)

    elif  data_type == 'environment':
        units = setting_units+environment_units
        columns = setting+environment
        sample_01 = setting_s01+environment_s01
        sample_02 = setting_s02+environment_s02
        sample_03 = setting_s03+environment_s03
        sample_04 = setting_s04+environment_s04

        data = pd.DataFrame([units,sample_01,sample_02,sample_03,sample_04],
                            columns = columns)

    elif  data_type == 'contaminants':
        units = setting_units+contaminants_units
        columns = setting+contaminants
        sample_01 = setting_s01+contaminants_s01
        sample_02 = setting_s02+contaminants_s02
        sample_03 = setting_s03+contaminants_s03
        sample_04 = setting_s04+contaminants_s04

        data = pd.DataFrame([units,sample_01,sample_02,sample_03,sample_04],
                            columns = columns)

    elif  data_type == 'metabolites':

        units = setting_units+metabolites_units
        columns = setting+metabolites
        sample_01 = setting_s01+metabolites_s01
        sample_02 = setting_s02+metabolites_s02
        sample_03 = setting_s03+metabolites_s03
        sample_04 = setting_s04+metabolites_s04

        data = pd.DataFrame([units,sample_01,sample_02,sample_03,sample_04],
                            columns = columns)

    elif  data_type == 'isotopes':

        units = setting_units+isotopes_units
        columns = setting+isotopes
        sample_01 = setting_s01+isotopes_s01
        sample_02 = setting_s02+isotopes_s02
        sample_03 = setting_s03+isotopes_s03
        sample_04 = setting_s04+isotopes_s04

        data = pd.DataFrame([units,sample_01,sample_02,sample_03,sample_04],
                            columns = columns)

    elif data_type == "set_env_cont":

        units = setting_units+environment_units+contaminants_units
        columns = setting+environment+contaminants
        sample_01 = setting_s01+environment_s01+contaminants_s01
        sample_02 = setting_s02+environment_s02+contaminants_s02
        sample_03 = setting_s03+environment_s03+contaminants_s03
        sample_04 = setting_s04+environment_s04+contaminants_s04

        data = pd.DataFrame([units,sample_01,sample_02,sample_03,sample_04],
                            columns = columns)

    elif data_type == 'all':
        units = setting_units+environment_units+contaminants_units+metabolites_units + isotopes_units
        columns = setting+environment+contaminants+metabolites + isotopes
        sample_01 = setting_s01+environment_s01+contaminants_s01+metabolites_s01+isotopes_s01
        sample_02 = setting_s02+environment_s02+contaminants_s02+metabolites_s02+isotopes_s02
        sample_03 = setting_s03+environment_s03+contaminants_s03+metabolites_s03+isotopes_s03
        sample_04 = setting_s04+environment_s04+contaminants_s04+metabolites_s04+isotopes_s04

        data = pd.DataFrame([units,sample_01,sample_02,sample_03,sample_04],
                            columns = columns)

    else:
        raise ValueError("Specified data type '{}' not available".format(data_type))

    if not with_units:
        data.drop(0,inplace = True)
        for quantity in data.columns[2:]:
            data[quantity] = pd.to_numeric(data[quantity])

    return data

load_data

Functions for data I/O handling.

@author: Alraune Zech

load_csv(file_path=None, verbose=False, store_provenance=False)

Function to load data from csv file.


file_path: str
    Name of the path to the file
verbose: Boolean
    verbose flag
store_provenance: Boolean
    To add!

data: pd.DataFrame
    Tabular data
units: pd.DataFrame
    Tabular data on units

ValueError: If `file_path` is not a valid file location
Example:

This function can be called with the file path of the example data as argument using:

>>> from mibiscreen.data import load_excel
>>> load_excel(example_data.csv)
Source code in mibiscreen/data/load_data.py
def load_csv(
        file_path = None,
        verbose = False,
        store_provenance = False,
        ):
    """Function to load data from csv file.

    Args:
    -------
        file_path: str
            Name of the path to the file
        verbose: Boolean
            verbose flag
        store_provenance: Boolean
            To add!

    Returns:
    -------
        data: pd.DataFrame
            Tabular data
        units: pd.DataFrame
            Tabular data on units

    Raises:
    -------
        ValueError: If `file_path` is not a valid file location

    Example:
    -------
       This function can be called with the file path of the example data as
       argument using:

        >>> from mibiscreen.data import load_excel
        >>> load_excel(example_data.csv)

    """
    if file_path is None:
        raise ValueError('Specify file path and file name!')
    if not os.path.isfile(file_path):
        raise OSError('Cannot access file at : ',file_path)

    data = pd.read_csv(file_path, encoding="unicode_escape")
    if ";" in data.iloc[1].iloc[0]:
        data = pd.read_csv(file_path, sep=";", encoding="unicode_escape")
    units = data.drop(labels = np.arange(1,data.shape[0]))

    if verbose:
        print('================================================================')
        print(" Running function 'load_csv()' on data file ", file_path)
        print('================================================================')
        print("Units of quantities:")
        print('-------------------')
        print(units)
        print('________________________________________________________________')
        print("Loaded data as pandas DataFrame:")
        print('--------------------------------')
        print(data)
        print('================================================================')

    return data, units

load_excel(file_path=None, sheet_name=0, verbose=False, store_provenance=False, **kwargs)

Function to load data from excel file.


file_path: str
    Name of the path to the file
sheet_name: int
    Number of the sheet in the excel file to load
verbose: Boolean
    verbose flag
store_provenance: Boolean
    To add!
**kwargs: optional keyword arguments to pass to pandas' routine
    read_excel()

data: pd.DataFrame
    Tabular data
units: pd.DataFrame
    Tabular data on units

ValueError: If `file_path` is not a valid file location
Example:

This function can be called with the file path of the example data as argument using:

>>> from mibiscreen.data import load_excel
>>> load_excel(example_data.xlsx)
Source code in mibiscreen/data/load_data.py
def load_excel(
        file_path = None,
        sheet_name = 0,
        verbose = False,
        store_provenance = False,
        **kwargs,
        ):
    """Function to load data from excel file.

    Args:
    -------
        file_path: str
            Name of the path to the file
        sheet_name: int
            Number of the sheet in the excel file to load
        verbose: Boolean
            verbose flag
        store_provenance: Boolean
            To add!
        **kwargs: optional keyword arguments to pass to pandas' routine
            read_excel()

    Returns:
    -------
        data: pd.DataFrame
            Tabular data
        units: pd.DataFrame
            Tabular data on units

    Raises:
    -------
        ValueError: If `file_path` is not a valid file location

    Example:
    -------
       This function can be called with the file path of the example data as
       argument using:

        >>> from mibiscreen.data import load_excel
        >>> load_excel(example_data.xlsx)

    """
    if file_path is None:
        raise ValueError('Specify file path and file name!')
    if not os.path.isfile(file_path):
        raise OSError('Cannot access file at : ',file_path)

    data = pd.read_excel(file_path,
                         sheet_name = sheet_name,
                         **kwargs)
    if ";" in data.iloc[1].iloc[0]:
        data = pd.read_excel(file_path,
                             sep=";",
                             sheet_name = sheet_name,
                             **kwargs)

    units = data.drop(labels = np.arange(1,data.shape[0]))

    if verbose:
        print('==============================================================')
        print(" Running function 'load_excel()' on data file ", file_path)
        print('==============================================================')
        print("Unit of quantities:")
        print('-------------------')
        print(units)
        print('________________________________________________________________')
        print("Loaded data as pandas DataFrame:")
        print('--------------------------------')
        print(data)
        print('================================================================')

    return data, units

set_data

Functions for data extraction and merging in preparation of analysis and plotting.

@author: Alraune Zech

compare_lists(list1, list2, verbose=False)

Checking overlap of two given list.

Input
list1: list of strings
    given extensive list (usually column names of a pd.DataFrame)
list2: list of strings
    list of names to extract/check overlap with strings in list 'column'
verbose: Boolean, default True
    verbosity flag
Output
(intersection, remainder_list1, reminder_list2): tuple of lists
    * intersection: list of strings present in both lists 'list1' and 'list2'
    * remainder_list1: list of strings only present in 'list1'
    * remainder_list2: list of strings only present in 'list2'
Example:

list1 = [‘test1’,’test2’] list2 = [‘test1’,’test3’]

([‘test1’],[‘test2’]['test3']) = compare_lists(list1,list2)

Source code in mibiscreen/data/set_data.py
def compare_lists(list1,
                  list2,
                  verbose = False,
                  ):
    """Checking overlap of two given list.

    Input
    -----
        list1: list of strings
            given extensive list (usually column names of a pd.DataFrame)
        list2: list of strings
            list of names to extract/check overlap with strings in list 'column'
        verbose: Boolean, default True
            verbosity flag

    Output
    ------
        (intersection, remainder_list1, reminder_list2): tuple of lists
            * intersection: list of strings present in both lists 'list1' and 'list2'
            * remainder_list1: list of strings only present in 'list1'
            * remainder_list2: list of strings only present in 'list2'

    Example:
    -------
    list1 = ['test1','test2']
    list2 =  ['test1','test3']

    (['test1'],['test2']['test3']) = compare_lists(list1,list2)

    """
    intersection = list(set(list1) & set(list2))
    remainder_list1 = list(set(list1) - set(list2))
    remainder_list2 = list(set(list2) - set(list1))

    if verbose:
        print('================================================================')
        print(" Running function 'extract_variables()'")
        print('================================================================')
        print("strings present in both lists:", intersection)
        print("strings only present in either of the lists:", remainder_list1 +  remainder_list2)

    return (intersection,remainder_list1,remainder_list2)

determine_quantities(cols, name_list='all', verbose=False)

Select a subset of column names (from DataFrame).

Input
cols: list
    Names of quantities (column names) from pd.DataFrame
name_list: str or list of str, default is 'all'
    quantities to extract from column names.

    If a list of strings is provided, these will be selected from the list of column names (col)
    If a string is provided, this is a short name for a specific group of quantities:
        - 'all' (all quantities given in data frame except settings)
        - short name for group of contaminants:
            - 'BTEX' (for benzene, toluene, ethylbenzene, xylene)
            - 'BTEXIIN' (for benzene, toluene, ethylbenzene, xylene,
                          indene, indane and naphthaline)
            - 'all_cont' (for all contaminant in name list)
        - short name for group of environmental parameters/geochemicals:
            - 'environmental_conditions'
            - 'geochemicals'
            - 'ONS':  non reduced electron acceptors (oxygen, nitrate, sulfate)
            - 'ONSFe': selected electron acceptors  (oxygen, nitrate, sulfate + iron II)
            - 'all_ea': all potential electron acceptors (non reduced & reduced)
            - 'NP': nutrients (nitrate, nitrite, phosphate)
        See also file mibiscreen/data/name_data for lists of quantities
verbose: Boolean
    verbose flag (default False)
Output
quantities: list
    list of strings with names of selected quantities present in dataframe
remainder: list
    list of strings with names of selected quantities not present in dataframe
Source code in mibiscreen/data/set_data.py
def determine_quantities(cols,
         name_list = 'all',
         verbose = False,
         ):
    """Select a subset of column names (from DataFrame).

    Input
    -----
        cols: list
            Names of quantities (column names) from pd.DataFrame
        name_list: str or list of str, default is 'all'
            quantities to extract from column names.

            If a list of strings is provided, these will be selected from the list of column names (col)
            If a string is provided, this is a short name for a specific group of quantities:
                - 'all' (all quantities given in data frame except settings)
                - short name for group of contaminants:
                    - 'BTEX' (for benzene, toluene, ethylbenzene, xylene)
                    - 'BTEXIIN' (for benzene, toluene, ethylbenzene, xylene,
                                  indene, indane and naphthaline)
                    - 'all_cont' (for all contaminant in name list)
                - short name for group of environmental parameters/geochemicals:
                    - 'environmental_conditions'
                    - 'geochemicals'
                    - 'ONS':  non reduced electron acceptors (oxygen, nitrate, sulfate)
                    - 'ONSFe': selected electron acceptors  (oxygen, nitrate, sulfate + iron II)
                    - 'all_ea': all potential electron acceptors (non reduced & reduced)
                    - 'NP': nutrients (nitrate, nitrite, phosphate)
                See also file mibiscreen/data/name_data for lists of quantities
        verbose: Boolean
            verbose flag (default False)

    Output
    ------
        quantities: list
            list of strings with names of selected quantities present in dataframe
        remainder: list
            list of strings with names of selected quantities not present in dataframe

    """
    if name_list == 'all':
        ### choosing all column names except those of settings
        list_names = list(set(cols) - set(sample_settings))
        if verbose:
            print("Selecting all data columns except for those with settings.")

    elif isinstance(name_list, str):
        if name_list in contaminant_groups.keys():
            verbose_text = "Selecting specific group of contaminants:"
            list_names = contaminant_groups[name_list].copy()
            if (names.name_o_xylene in cols) and (names.name_pm_xylene in cols):
                list_names.remove(names.name_xylene) # handling of xylene isomeres

        elif name_list in environment_groups.keys():
            verbose_text = "Selecting specific group of geochemicals:"
            list_names = environment_groups[name_list].copy()

        else:
            verbose_text = "Selecting single quantity:"
            list_names = [name_list]

        if verbose:
            print(verbose_text,*name_list,sep='\n')

    elif isinstance(name_list, list): # choosing specific list of column names except those of settings
        if not all(isinstance(item, str) for item in name_list):
            raise ValueError("Keyword 'name_list' needs to be a string or a list of strings.")
        list_names = name_list
        if verbose:
            print("Selecting all names from provided list.")

    else:
        raise ValueError("Keyword 'name_list' needs to be a string or a list of strings.")

    quantities,_,remainder_list2 = compare_lists(cols,list_names)

    if not quantities:
        raise ValueError("No quantities from name list '{}' provided in data.\
                         Presumably data not in standardized format. \
                         Run 'standardize()' first.".format(name_list))

    if verbose:
        print("Selected set of quantities: ", *quantities,sep='\n')

    if remainder_list2:
        print("WARNING: quantities from name list not in data:", *remainder_list2,sep='\n')
        print("Maybe data not in standardized format. Run 'standardize()' first.")
        print("_________________________________________________________________")


    return quantities,remainder_list2

extract_data(data_frame, name_list, keep_setting_data=True, verbose=False)

Extracting data of specified variables from dataframe.


data_frame: pandas.DataFrames
    dataframe with the measurements
name_list: list of strings
    list of column names to extract from dataframe
keep_setting_data: bool, default True
    Whether to keep setting data in the DataFrame.
verbose: Boolean
    verbose flag (default False)

data: pd.DataFrame
    dataframe with the measurements
Raises:

None (yet).

Example:

To be added.

Source code in mibiscreen/data/set_data.py
def extract_data(data_frame,
                 name_list,
                 keep_setting_data = True,
                 verbose = False,
                 ):
    """Extracting data of specified variables from dataframe.

    Args:
    -------
        data_frame: pandas.DataFrames
            dataframe with the measurements
        name_list: list of strings
            list of column names to extract from dataframe
        keep_setting_data: bool, default True
            Whether to keep setting data in the DataFrame.
        verbose: Boolean
            verbose flag (default False)

    Returns:
    -------
        data: pd.DataFrame
            dataframe with the measurements

    Raises:
    -------
    None (yet).

    Example:
    -------
    To be added.

    """
    ### check on correct data input format and extracting column names as list
    data,cols= check_data_frame(data_frame,inplace = False)

    quantities, _ = determine_quantities(cols,
                                      name_list = name_list,
                                      verbose = verbose)

    if keep_setting_data:
        settings,_,_ = compare_lists(cols,sample_settings)
        i1,quantities_without_settings,_ = compare_lists(quantities,settings)
        columns_names = settings + quantities_without_settings

    else:
        columns_names = quantities

    return data[columns_names]

extract_settings(data_frame, verbose=False)

Extracting data of specified variables from dataframe.


data_frame: pandas.DataFrames
    dataframe with the measurements
verbose: Boolean
    verbose flag (default False)

data: pd.DataFrame
    dataframe with settings
Raises:

None (yet).

Example:

To be added.

Source code in mibiscreen/data/set_data.py
def extract_settings(data_frame,
                     verbose = False,
                     ):
    """Extracting data of specified variables from dataframe.

    Args:
    -------
        data_frame: pandas.DataFrames
            dataframe with the measurements
        verbose: Boolean
            verbose flag (default False)

    Returns:
    -------
        data: pd.DataFrame
            dataframe with settings

    Raises:
    -------
    None (yet).

    Example:
    -------
    To be added.

    """
    ### check on correct data input format and extracting column names as list
    data,cols= check_data_frame(data_frame,inplace = False)

    settings,r1,r2 = compare_lists(cols,sample_settings)

    if verbose:
        print("Settings available in data: ", settings)

    return data[settings]

merge_data(data_frames_list, how='outer', on=[names.name_sample], clean=True, **kwargs)

Merging dataframes along columns on similar sample name.


data_frames_list: list of pd.DataFrame
    list of dataframes with the measurements
how: str, default 'outer'
    Type of merge to be performed.
    corresponds to keyword in pd.merge()
    {‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default ‘outer’
on: list, default "sample_nr"
    Column name(s) to join on.
    corresponds to keyword in pd.merge()
clean: Boolean, default True
    Whether to drop columns which are in all provided data_frames
    (on which not to merge, potentially other settings than sample_name)
**kwargs: dict
    optional keyword arguments to be passed to pd.merge()

data: pd.DataFrame
    dataframe with the measurements
Raises:

None (yet).

Example:

To be added.

Source code in mibiscreen/data/set_data.py
def merge_data(data_frames_list,
               how='outer',
               on=[names.name_sample],
               clean = True,
               **kwargs,
               ):
    """Merging dataframes along columns on similar sample name.

    Args:
    -------
        data_frames_list: list of pd.DataFrame
            list of dataframes with the measurements
        how: str, default 'outer'
            Type of merge to be performed.
            corresponds to keyword in pd.merge()
            {‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default ‘outer’
        on: list, default "sample_nr"
            Column name(s) to join on.
            corresponds to keyword in pd.merge()
        clean: Boolean, default True
            Whether to drop columns which are in all provided data_frames
            (on which not to merge, potentially other settings than sample_name)
        **kwargs: dict
            optional keyword arguments to be passed to pd.merge()

    Returns:
    -------
        data: pd.DataFrame
            dataframe with the measurements

    Raises:
    -------
    None (yet).

    Example:
    -------
    To be added.

    """
    if len(data_frames_list)<2:
        raise ValueError('Provide List of DataFrames.')


    data_merge = data_frames_list[0]
    for data_add in data_frames_list[1:]:
        if clean:
            intersection,remainder_list1,remainder_list2 = compare_lists(
                data_merge.columns.to_list(),data_add.columns.to_list())
            intersection,remainder_list1,remainder_list2 = compare_lists(intersection,on)
            data_add = data_add.drop(labels = remainder_list1+remainder_list2,axis = 1)
        data_merge = pd.merge(data_merge,data_add, how=how, on=on,**kwargs)
        # complete data set, where values of porosity are added (otherwise nan)

    return data_merge

settings

mibiscreen module for settings of data.

contaminants

Specifications of petroleum hydrocarbon related contaminants.

List of (PAH) contamiants measured in groundwater samples useful for biodegredation and bioremediation analysis

@author: A. Zech

environment

Specifications of geochemicals.

List of geochemicals measured in groundwater samples useful for biodegredation and bioremediation analysis

@author: A. Zech

isotopes

Specifications of isotopes.

List of basic isotopes measured in groundwater samples useful for biodegredation and bioremediation analysis

@author: A. Zech

metabolites

Specifications of metabolies.

List of basic metabolites measured in groundwater samples useful for biodegredation and bioremediation analysis

@author: A. Zech

sample_settings

Specifications of sample settings.

List of all quantities and parameters characterizing sample specifics for measurements in groundwater samples useful for biodegredation and bioremediation analysis

@author: A. Zech

standard_names

Name specifications of data!

Listed are all standard names of quantities and parameters measured in groundwater samples useful for biodegredation and bioremediation analysis

@author: A. Zech

unit_settings

Unit specifications of data!

File containing unit specifications of quantities and parameters measured in groundwater samples useful for biodegredation and bioremediation analysis.

@author: Alraune Zech