Skip to content

mibiscreen.data API reference

mibiscreen module for data handling.

check_data

Functions for data handling and standardization.

@author: Alraune Zech

check_columns(data_frame, standardize=False, reduce=False, verbose=True)

Function checking names of columns of data frame.

Function that looks at the column names and links it to standard names. Optionally, it renames identified column names to the standard names of the model.


data_frame: pd.DataFrame
    dataframe with the measurements
standardize: Boolean, default False
    Whether to standardize identified column names
reduce: Boolean, default False
    Whether to reduce data to known quantities
verbose: Boolean, default True
    verbosity flag

tuple: three list containing names of
        list with identitied quantities in data (but not standardized names)
        list with unknown quantities in data (not in list of standardized names)
        list with standard names of identified quantities
Raises:

None (yet).

Example:

Todo’s: - complete list of potential contaminants, environmental factors - add name check for metabolites?

Source code in mibiscreen/data/check_data.py
def check_columns(data_frame,
                  standardize = False,
                  reduce = False,
                  verbose = True):
    """Function checking names of columns of data frame.

    Function that looks at the column names and links it to standard names.
    Optionally, it renames identified column names to the standard names of the model.

    Args:
    -------
        data_frame: pd.DataFrame
            dataframe with the measurements
        standardize: Boolean, default False
            Whether to standardize identified column names
        reduce: Boolean, default False
            Whether to reduce data to known quantities
        verbose: Boolean, default True
            verbosity flag

    Returns:
    -------
        tuple: three list containing names of
                list with identitied quantities in data (but not standardized names)
                list with unknown quantities in data (not in list of standardized names)
                list with standard names of identified quantities

    Raises:
    -------
    None (yet).

    Example:
    -------
    Todo's:
        - complete list of potential contaminants, environmental factors
        - add name check for metabolites?
    """
    if verbose:
        print('==============================================================')
        print(" Running function 'check_columns()' on data")
        print('==============================================================')

    data,cols= check_data_frame(data_frame,
                                sample_name_to_index = False,
                                inplace = True)

    results = standard_names(cols,
                             standardize = False,
                             reduce = False,
                             verbose = False,
                             )

    column_names_standard = results[0]
    column_names_known = results[1]
    column_names_unknown = results[2]
    column_names_transform = results[3]

    if standardize:
        data.columns = [column_names_transform.get(x, x) for x in data.columns]

    if reduce:
        data.drop(labels = column_names_unknown,axis = 1,inplace=True)

    if verbose:
        print("{} quantities identified in provided data.".format(len(column_names_known)))
        print("List of names with standard names:")
        print('----------------------------------')
        for i,name in enumerate(column_names_known):
            print(name," --> ",column_names_standard[i])
        print('----------------------------------')
        if standardize:
            print("Identified column names have been standardized")
        else:
            print("\nRenaming can be done by setting keyword 'standardize' to True.\n")
        print('________________________________________________________________')
        print("{} quantities have not been identified in provided data:".format(len(column_names_unknown)))
        print('---------------------------------------------------------')
        for i,name in enumerate(column_names_unknown):
            print(name)
        print('---------------------------------------------------------')
        if reduce:
            print("Not identified quantities have been removed from data frame")
        else:
            print("\nReduction to known quantities can be done by setting keyword 'reduce' to True.\n")
        print('================================================================')

    return (column_names_known,column_names_unknown,column_names_standard)

check_data_frame(data_frame, sample_name_to_index=False, inplace=False)

Checking data on correct format.

Tests if provided data is a pandas data frame and provides column names. Optionally it sets the sample name as index.

Input
data_frame: pd.DataFrame
    quantities for data analysis given per sample
sample_name_to_index:  Boolean, default False
    Whether to set the sample name to the index of the DataFrame
inplace: Boolean, default False
    Whether to modify the DataFrame rather than creating a new one.
Output
data: pd.DataFrame
    copy of given dataframe with index set to sample name
cols: list
    List of column names
Source code in mibiscreen/data/check_data.py
def check_data_frame(data_frame,
                     sample_name_to_index = False,
                     inplace = False,
                     ):
    """Checking data on correct format.

    Tests if provided data is a pandas data frame and provides column names.
    Optionally it sets the sample name as index.

    Input
    -----
        data_frame: pd.DataFrame
            quantities for data analysis given per sample
        sample_name_to_index:  Boolean, default False
            Whether to set the sample name to the index of the DataFrame
        inplace: Boolean, default False
            Whether to modify the DataFrame rather than creating a new one.

    Output
    ------
        data: pd.DataFrame
            copy of given dataframe with index set to sample name
        cols: list
            List of column names
    """
    if not isinstance(data_frame, pd.DataFrame):
        raise ValueError("Data has to be a panda-DataFrame or Series \
                          but is given as type {}".format(type(data_frame)))

    if inplace is False:
        data = data_frame.copy()
    else:
        data = data_frame

    if sample_name_to_index:
        if names.name_sample not in data.columns:
            print("Warning: No sample name provided for making index. Consider standardizing data first")
        else:
            data.set_index(names.name_sample,inplace = True)

    if isinstance(data, pd.Series):
        cols = [data.name]
    else:
        cols = data.columns.to_list()

    return data, cols

check_units(data, verbose=True)

Function to check the units of the measurements.


data: pandas.DataFrames
    dataframe with the measurements where first row contains
    the units or a dataframe with only the column names and units
verbose: Boolean
    verbose statement (default True)

col_check_list: list
    quantities whose units need checking/correction

None (yet).
Example:
To be added.
Source code in mibiscreen/data/check_data.py
def check_units(data,
                verbose = True):
    """Function to check the units of the measurements.

    Args:
    -------
        data: pandas.DataFrames
            dataframe with the measurements where first row contains
            the units or a dataframe with only the column names and units
        verbose: Boolean
            verbose statement (default True)

    Returns:
    -------
        col_check_list: list
            quantities whose units need checking/correction

    Raises:
    -------
        None (yet).

    Example:
    -------
        To be added.
    """
    if verbose:
        print('================================================================')
        print(" Running function 'check_units()' on data")
        print('================================================================')

    if not isinstance(data, pd.DataFrame):
        raise ValueError("Provided data is not a data frame.")
    elif data.shape[0]>1:
        units = data.drop(labels = np.arange(1,data.shape[0]))
    else:
        units = data.copy()

    units_in_data = set(map(lambda x: str(x).lower(), units.iloc[0,:].values))
    ### testing if provided data frame contains any unit
    test_unit = False
    for u in all_units:
        if u in units_in_data:
            test_unit = True
            break
    if not test_unit:
        raise ValueError("Error: The second line in the dataframe is supposed\
                         to specify the units. No units were detected in this\
                         line, check https://mibipret.github.io/mibiscreen/ Data\
                         documentation.")

    # standardize column names (as it might not has happened for data yet)
    check_columns(units,standardize = True, verbose = False)
    col_check_list= []

    for quantity in units.columns:
        if quantity in names.chemical_composition:
            if str(units[quantity][0]).lower() not in standard_units['mgperl']:
                col_check_list.append(quantity)
                if verbose:
                    print("Warning: Check unit of {}!\n Given in {}, but must be milligramm per liter (e.g. {})."
                              .format(quantity,units[quantity][0],standard_units['mgperl'][0]))

        if quantity in names.contaminants['all_cont']:
            if str(units[quantity][0]).lower() not in standard_units['microgperl']:
                col_check_list.append(quantity)
                if verbose:
                    print("Warning: Check unit of {}!\n Given in {}, but must be microgramm per liter (e.g. {})."
                              .format(quantity,units[quantity][0],standard_units['microgperl'][0]))

        if quantity in list(units_env_cond.keys()):
            unit_type = units_env_cond[quantity]
            if str(units[quantity][0]).lower() not in standard_units[unit_type]:
                col_check_list.append(quantity)
                if verbose:
                    print("Warning: Check unit of {}!\n Given in {}, but must be in {} (e.g. {}).".format(
                            quantity,units[quantity][0],unit_type,standard_units[unit_type][0]))

        if quantity.split('-')[0] in names.isotopes:
            if str(units[quantity][0]).lower() not in standard_units['permil']:
                col_check_list.append(quantity)
                if verbose:
                    print("Warning: Check unit of {}!\n Given in {}, but must be per mille (e.g. {})."
                              .format(quantity,units[quantity][0],standard_units['permil'][0]))

        if quantity in names.metabolites:
            if str(units[quantity][0]).lower() not in standard_units['microgperl']:
                col_check_list.append(quantity)
                if verbose:
                    print("Warning: Check unit of {}!\n Given in {}, but must be microgramm per liter (e.g. {})."
                              .format(quantity,units[quantity][0],standard_units['microgperl'][0]))

    if verbose:
        print('________________________________________________________________')
        if len(col_check_list) == 0:
            print(" All identified quantities given in requested units.")
        else:
            print(" All other identified quantities given in requested units.")
        print('================================================================')

    return col_check_list

check_values(data_frame, inplace=False, verbose=True)

Function that checks on value types and replaces non-measured values.


data_frame: pandas.DataFrames
    dataframe with the measurements (without first row of units)
inplace: Boolean, default False
    Whether to modify the DataFrame rather than creating a new one.
verbose: Boolean
    verbose statement (default True)

data_pure: pandas.DataFrame
    Tabular data with standard column names and without units

None (yet).
Example:
To be added.
Source code in mibiscreen/data/check_data.py
def check_values(data_frame,
                 inplace = False,
                 verbose = True,
                 ):
    """Function that checks on value types and replaces non-measured values.

    Args:
    -------
        data_frame: pandas.DataFrames
            dataframe with the measurements (without first row of units)
        inplace: Boolean, default False
            Whether to modify the DataFrame rather than creating a new one.
        verbose: Boolean
            verbose statement (default True)

    Returns:
    -------
        data_pure: pandas.DataFrame
            Tabular data with standard column names and without units

    Raises:
    -------
        None (yet).

    Example:
    -------
        To be added.
    """
    if verbose:
        print('================================================================')
        print(" Running function 'check_values()' on data")
        print('================================================================')

    data,cols= check_data_frame(data_frame, inplace = inplace)

    ### testing if provided data frame contains first row with units
    for u in data.iloc[0].to_list():
        if u in all_units:
            print("WARNING: First row identified as units, has been removed for value check")
            print('________________________________________________________________')
            data.drop(labels = 0,inplace = True)
            break

    for sign in to_replace_list:
        data.iloc[:,:] = data.iloc[:,:].replace(to_replace=sign, value=to_replace_value)

    # standardize column names (as it might not has happened for data yet)
    # check_columns(data,
    #               standardize = True,
    #               check_metabolites=True,
    #               verbose = False)

    # transform data to numeric values
    quantities_transformed = []
    for quantity in cols: #data.columns:
        try:
            # data_pure.loc[:,quantity] = pd.to_numeric(data_pure.loc[:,quantity])
            data[quantity] = pd.to_numeric(data[quantity])
            quantities_transformed.append(quantity)
        except ValueError:
            print("WARNING: Cound not transform '{}' to numerical values".format(quantity))
            print('________________________________________________________________')
    if verbose:
        print("Quantities with values transformed to numerical (int/float):")
        print('-----------------------------------------------------------')
        for name in quantities_transformed:
            print(name)
        print('================================================================')

    return data

standard_names(name_list, standardize=True, reduce=False, verbose=False)

Function transforming list of names to standard names.

Function that looks at the names (of e.g. environmental variables, contaminants, metabolites, isotopes, etc) and provides the corresponding standard names.


name_list: string or list of strings
    names of quantities to be transformed to standard
standardize: Boolean, default False
    Whether to standardize identified column names
reduce: Boolean, default False
    Whether to reduce data to known quantities
verbose: Boolean, default True
    verbosity flag

tuple: three list containing names of
        list with identitied quantities in data (but not standardized names)
        list with unknown quantities in data (not in list of standardized names)
        list with standard names of identified quantities
Raises:

None (yet).

Example:

Todo’s: - complete list of potential contaminants, environmental factors - add name check for metabolites?

Source code in mibiscreen/data/check_data.py
def standard_names(name_list,
                   standardize = True,
                   reduce = False,
                   verbose = False,
                   ):
    """Function transforming list of names to standard names.

    Function that looks at the names (of e.g. environmental variables, contaminants,
    metabolites, isotopes, etc) and provides the corresponding standard names.

    Args:
    -------
        name_list: string or list of strings
            names of quantities to be transformed to standard
        standardize: Boolean, default False
            Whether to standardize identified column names
        reduce: Boolean, default False
            Whether to reduce data to known quantities
        verbose: Boolean, default True
            verbosity flag

    Returns:
    -------
        tuple: three list containing names of
                list with identitied quantities in data (but not standardized names)
                list with unknown quantities in data (not in list of standardized names)
                list with standard names of identified quantities

    Raises:
    -------
    None (yet).

    Example:
    -------
    Todo's:
        - complete list of potential contaminants, environmental factors
        - add name check for metabolites?
    """
    names_standard = []
    names_known = []
    names_unknown = []
    names_transform = {}

    dict_names = names.col_dict.copy()

    if isinstance(name_list, str):
        name_list = [name_list]
    elif isinstance(name_list, list):
        for name in name_list:
            if not isinstance(name, str):
                raise ValueError("Entry in provided list of names is not a string:", name)

    for x in name_list:
        y = dict_names.get(x, False)
        x_isotope = x.split('-')[0]
        y_isotopes = names.names_isotopes.get(x_isotope.lower(), False)

        if y_isotopes is not False:
            x_molecule = x.removeprefix(x_isotope+'-')
            y_molecule = names.names_contaminants.get(x_molecule.lower(), False)
            if y_molecule is False:
                names_unknown.append(x)
            else:
                y = y_isotopes+'-'+y_molecule
                names_known.append(x)
                names_standard.append(y)
                names_transform[x] = y
        else:
            y = dict_names.get(x.lower(), False)
            if y is False:
                names_unknown.append(x)
            else:
                names_known.append(x)
                names_standard.append(y)
                names_transform[x] = y

    if verbose:
        print('================================================================')
        print(" Running function 'standard_names()'")
        print('================================================================')
        print("{} of {} quantities identified in name list.".format(len(names_known),len(name_list)))
        print("List of names with standard names:")
        print('----------------------------------')
        for i,name in enumerate(names_known):
            print(name," --> ",names_standard[i])
        print('----------------------------------')
        if standardize:
            print("Identified column names have been standardized")
        else:
            print("\nRenaming can be done by setting keyword 'standardize' to True.\n")
        print('________________________________________________________________')
        print("{} quantities have not been identified in provided data:".format(len(names_unknown)))
        print('---------------------------------------------------------')
        for i,name in enumerate(names_unknown):
            print(name)
        print('---------------------------------------------------------')
        if reduce:
            print("Not identified quantities have been removed from data frame")
        else:
            print("\nReduction to known quantities can be done by setting keyword 'reduce' to True.\n")
        print('================================================================')

    if standardize:
        if reduce:
            return names_standard
        else:
            return names_standard + names_unknown
    else:
        return (names_standard, names_known, names_unknown, names_transform)

standardize(data_frame, reduce=True, store_csv=False, verbose=True)

Function providing condensed data frame with standardized names.

Function is checking names of columns and renames columns, condenses data to identified column names, checks units and names sof data frame.

Function that looks at the column names and renames the columns to the standard names of the model.


data_frame: pandas.DataFrames
    dataframe with the measurements
check_metabolites: Boolean, default False
    whether to check on metabolites' values
reduce: Boolean, default True
    whether to reduce data to known quantities (default True),
    otherwise full dataframe with renamed columns (for those identifyable) is returned
store_csv: Boolean, default False
    whether to save dataframe in standard format to csv-file
verbose: Boolean, default True
    verbose statement

data_numeric, units: pandas.DataFrames
    Tabular data with standardized column names, values in numerics etc
    and table with units for standardized column names

None (yet).
Example:

Todo’s: - complete list of potential contaminants, environmental factors - add name check for metabolites? - add key-word to specify which data to extract (i.e. data columns to return)

Source code in mibiscreen/data/check_data.py
def standardize(data_frame,
                reduce = True,
                store_csv = False,
                verbose=True,
                ):
    """Function providing condensed data frame with standardized names.

    Function is checking names of columns and renames columns,
    condenses data to identified column names, checks units and  names
    sof data frame.

    Function that looks at the column names and renames the columns to
    the standard names of the model.

    Args:
    -------
        data_frame: pandas.DataFrames
            dataframe with the measurements
        check_metabolites: Boolean, default False
            whether to check on metabolites' values
        reduce: Boolean, default True
            whether to reduce data to known quantities (default True),
            otherwise full dataframe with renamed columns (for those identifyable) is returned
        store_csv: Boolean, default False
            whether to save dataframe in standard format to csv-file
        verbose: Boolean, default True
            verbose statement

    Returns:
    -------
        data_numeric, units: pandas.DataFrames
            Tabular data with standardized column names, values in numerics etc
            and table with units for standardized column names

    Raises:
    -------
        None (yet).

    Example:
    -------
    Todo's:
        - complete list of potential contaminants, environmental factors
        - add name check for metabolites?
        - add key-word to specify which data to extract
            (i.e. data columns to return)

    """
    if verbose:
        print('================================================================')
        print(" Running function 'standardize()' on data")
        print('================================================================')
        print(' Function performing check of data including:')
        print('  * check of column names and standardizing them.')
        print('  * check of units and outlining which to adapt.')
        print('  * check of values, replacing empty values by nan \n    and making them numeric')

    data,cols= check_data_frame(data_frame,
                                sample_name_to_index = False,
                                inplace = False)

    # general column check & standardize column names
    check_columns(data,
                  standardize = True,
                  reduce = reduce,
                  verbose = verbose)

    # general unit check
    units = data.drop(labels = np.arange(1,data.shape[0]))
    col_check_list = check_units(units,
                                 verbose = verbose)

    # transform data to numeric values
    data_numeric = check_values(data.drop(labels = 0),
                                inplace = False,
                                verbose = verbose)

    # store standard data to file
    if store_csv:
        if len(col_check_list) != 0:
            print('________________________________________________________________')
            print("Data could not be saved because not all identified \n quantities are given in requested units.")
        else:
            try:
                data.to_csv(store_csv,index=False)
                if verbose:
                    print('________________________________________________________________')
                    print("Save standardized dataframe to file:\n", store_csv)
            except OSError:
                print("WARNING: data could not be saved. Check provided file path and name: {}".format(store_csv))
    if verbose:
        print('================================================================')

    return data_numeric, units

example_data

Example dat.

Measurements on quantities and parameters in groundwater samples used for biodegredation and bioremediation analysis.

@author: Alraune Zech

example_data(data_type='all', with_units=False)

Function provinging test data for mibiscreen data analysis.


data_type: string
    Type of data to return:
        -- "all": all types of data available
        -- "set_env_cont": well setting, environmental and contaminants data
        -- "setting": well setting data only
        -- "environment": data on environmental
        -- "contaminants": data on contaminants
        -- "metabolites": data on metabolites
        -- "isotopes": data on isotopes
        -- "hydro": data on hydrogeolocial conditions
with_units: Boolean, default False
    flag to provide first row with units
    if False (no units), values in columns will be numerical
    if True (with units), values in columns will be objects

pandas.DataFrame: Tabular data with standard column names

None
Example:
To be added!
Source code in mibiscreen/data/example_data.py
def example_data(data_type = 'all',
                 with_units = False,
                 ):
    """Function provinging test data for mibiscreen data analysis.

    Args:
    -------
        data_type: string
            Type of data to return:
                -- "all": all types of data available
                -- "set_env_cont": well setting, environmental and contaminants data
                -- "setting": well setting data only
                -- "environment": data on environmental
                -- "contaminants": data on contaminants
                -- "metabolites": data on metabolites
                -- "isotopes": data on isotopes
                -- "hydro": data on hydrogeolocial conditions
        with_units: Boolean, default False
            flag to provide first row with units
            if False (no units), values in columns will be numerical
            if True (with units), values in columns will be objects

    Returns:
    -------
        pandas.DataFrame: Tabular data with standard column names

    Raises:
    -------
        None

    Example:
    -------
        To be added!
    """
    mgl = standard_units['mgperl'][0]
    microgl = standard_units['microgperl'][0]

    setting = [names.name_sample,names.name_observation_well,names.name_sample_depth]
    setting_units = [' ',' ',standard_units['meter'][0]]
    setting_s01 = ['2000-001', 'B-MLS1-3-12', -12.]
    setting_s02 = ['2000-002', 'B-MLS1-5-15', -15.5]
    setting_s03 = ['2000-003', 'B-MLS1-6-17', -17.]
    setting_s04 = ['2000-004', 'B-MLS1-7-19', -19.]

    environment = [names.name_pH,
                   names.name_EC,
                   names.name_redox,
                   names.name_oxygen,
                   names.name_nitrate,
                   names.name_nitrite,
                   names.name_sulfate,
                   names.name_ammonium,
                   names.name_sulfide,
                   names.name_methane,
                   names.name_ironII,
                   names.name_manganese,
                   names.name_phosphate]

    environment_units = [' ',standard_units['microsimpercm'][0],standard_units['millivolt'][0],
                         mgl,mgl,mgl,mgl,mgl,mgl,mgl,mgl,mgl,mgl]
    environment_s01 = [7.23, 322., -208.,0.3,122.,0.58, 23., 5., 0., 748., 3., 1.,1.6]
    environment_s02 = [7.67, 405., -231.,0.9,5.,0.0, 0., 6., 0., 2022., 1., 0.,0]
    environment_s03 = [7.75, 223., -252.,0.1,3.,0.03, 1., 13., 0., 200., 1., 0.,0.8]
    environment_s04 = [7.53, 58., -317.,0., 180.,1., 9., 15., 6., 122., 0., 0.,0.1]

    contaminants = [names.name_benzene,
                    names.name_toluene,
                    names.name_ethylbenzene,
                    names.name_pm_xylene,
                    names.name_o_xylene,
                    names.name_indane,
                    names.name_indene,
                    names.name_naphthalene]

    contaminants_units = [microgl,microgl,microgl,microgl,
                          microgl,microgl,microgl,microgl]
    contaminants_s01 = [263., 2., 269., 14., 51., 1254., 41., 2207.]
    contaminants_s02 = [179., 7., 1690., 751., 253., 1352., 15., 5410.]
    contaminants_s03 = [853., 17., 1286., 528., 214., 1031., 31., 3879.]
    contaminants_s04 = [1254., 10., 1202., 79., 61., 814., 59., 1970.]

    metabolites = [names.name_phenol,
                   names.name_cinnamic_acid,
                   names.name_benzoic_acid]

    metabolites_units = [microgl,microgl,microgl]
    metabolites_s01 = [0.2, 0.4, 1.4]
    metabolites_s02 = [np.nan, 0.1, 0.]
    metabolites_s03 = [0., 11.4, 5.4]
    metabolites_s04 = [0.3, 0.5, 0.7]

    # isotopes = ['delta_13C-benzene','delta_2H-benzene']
    isotopes = [names.name_13C+'-'+names.name_benzene,
                names.name_2H+'-'+names.name_benzene,
                ]

    isotopes_units = [standard_units['permil'][0],standard_units['permil'][0]]
    isotopes_s01 = [-26.1,-106.]
    isotopes_s02 = [-25.8,-110.]
    isotopes_s03 = [-24.1,-118.]
    isotopes_s04 = [-24.1,-117.]

    if  data_type == 'setting':
        data = pd.DataFrame([setting_units,setting_s01,setting_s02,setting_s03,
                             setting_s04],columns = setting)

    elif  data_type == 'environment':
        units = setting_units+environment_units
        columns = setting+environment
        sample_01 = setting_s01+environment_s01
        sample_02 = setting_s02+environment_s02
        sample_03 = setting_s03+environment_s03
        sample_04 = setting_s04+environment_s04

        data = pd.DataFrame([units,sample_01,sample_02,sample_03,sample_04],
                            columns = columns)

    elif  data_type == 'contaminants':
        units = setting_units+contaminants_units
        columns = setting+contaminants
        sample_01 = setting_s01+contaminants_s01
        sample_02 = setting_s02+contaminants_s02
        sample_03 = setting_s03+contaminants_s03
        sample_04 = setting_s04+contaminants_s04

        data = pd.DataFrame([units,sample_01,sample_02,sample_03,sample_04],
                            columns = columns)

    elif  data_type == 'metabolites':

        units = setting_units+metabolites_units
        columns = setting+metabolites
        sample_01 = setting_s01+metabolites_s01
        sample_02 = setting_s02+metabolites_s02
        sample_03 = setting_s03+metabolites_s03
        sample_04 = setting_s04+metabolites_s04

        data = pd.DataFrame([units,sample_01,sample_02,sample_03,sample_04],
                            columns = columns)

    elif  data_type == 'isotopes':

        units = setting_units+isotopes_units
        columns = setting+isotopes
        sample_01 = setting_s01+isotopes_s01
        sample_02 = setting_s02+isotopes_s02
        sample_03 = setting_s03+isotopes_s03
        sample_04 = setting_s04+isotopes_s04

        data = pd.DataFrame([units,sample_01,sample_02,sample_03,sample_04],
                            columns = columns)

    elif data_type == "set_env_cont":

        units = setting_units+environment_units+contaminants_units
        columns = setting+environment+contaminants
        sample_01 = setting_s01+environment_s01+contaminants_s01
        sample_02 = setting_s02+environment_s02+contaminants_s02
        sample_03 = setting_s03+environment_s03+contaminants_s03
        sample_04 = setting_s04+environment_s04+contaminants_s04

        data = pd.DataFrame([units,sample_01,sample_02,sample_03,sample_04],
                            columns = columns)

    elif data_type == 'all':
        units = setting_units+environment_units+contaminants_units+metabolites_units + isotopes_units
        columns = setting+environment+contaminants+metabolites + isotopes
        sample_01 = setting_s01+environment_s01+contaminants_s01+metabolites_s01+isotopes_s01
        sample_02 = setting_s02+environment_s02+contaminants_s02+metabolites_s02+isotopes_s02
        sample_03 = setting_s03+environment_s03+contaminants_s03+metabolites_s03+isotopes_s03
        sample_04 = setting_s04+environment_s04+contaminants_s04+metabolites_s04+isotopes_s04

        data = pd.DataFrame([units,sample_01,sample_02,sample_03,sample_04],
                            columns = columns)

    else:
        raise ValueError("Specified data type '{}' not available".format(data_type))

    if not with_units:
        data.drop(0,inplace = True)
        for quantity in data.columns[2:]:
            data[quantity] = pd.to_numeric(data[quantity])

    return data

load_data

Functions for data I/O handling.

@author: Alraune Zech

load_csv(file_path=None, verbose=False, store_provenance=False)

Function to load data from csv file.


file_path: str
    Name of the path to the file
verbose: Boolean
    verbose flag
store_provenance: Boolean
    To add!

data: pd.DataFrame
    Tabular data
units: pd.DataFrame
    Tabular data on units

ValueError: If `file_path` is not a valid file location
Example:

This function can be called with the file path of the example data as argument using:

>>> from mibiscreen.data import load_excel
>>> load_excel(example_data.csv)
Source code in mibiscreen/data/load_data.py
def load_csv(
        file_path = None,
        verbose = False,
        store_provenance = False,
        ):
    """Function to load data from csv file.

    Args:
    -------
        file_path: str
            Name of the path to the file
        verbose: Boolean
            verbose flag
        store_provenance: Boolean
            To add!

    Returns:
    -------
        data: pd.DataFrame
            Tabular data
        units: pd.DataFrame
            Tabular data on units

    Raises:
    -------
        ValueError: If `file_path` is not a valid file location

    Example:
    -------
       This function can be called with the file path of the example data as
       argument using:

        >>> from mibiscreen.data import load_excel
        >>> load_excel(example_data.csv)

    """
    if file_path is None:
        raise ValueError('Specify file path and file name!')
    if not os.path.isfile(file_path):
        raise OSError('Cannot access file at : ',file_path)

    data = pd.read_csv(file_path, encoding="unicode_escape")
    if ";" in data.iloc[1].iloc[0]:
        data = pd.read_csv(file_path, sep=";", encoding="unicode_escape")
    units = data.drop(labels = np.arange(1,data.shape[0]))

    if verbose:
        print('================================================================')
        print(" Running function 'load_csv()' on data file ", file_path)
        print('================================================================')
        print("Units of quantities:")
        print('-------------------')
        print(units)
        print('________________________________________________________________')
        print("Loaded data as pandas DataFrame:")
        print('--------------------------------')
        print(data)
        print('================================================================')

    return data, units

load_excel(file_path=None, sheet_name=0, verbose=False, store_provenance=False, **kwargs)

Function to load data from excel file.


file_path: str
    Name of the path to the file
sheet_name: int
    Number of the sheet in the excel file to load
verbose: Boolean
    verbose flag
store_provenance: Boolean
    To add!
**kwargs: optional keyword arguments to pass to pandas' routine
    read_excel()

data: pd.DataFrame
    Tabular data
units: pd.DataFrame
    Tabular data on units

ValueError: If `file_path` is not a valid file location
Example:

This function can be called with the file path of the example data as argument using:

>>> from mibiscreen.data import load_excel
>>> load_excel(example_data.xlsx)
Source code in mibiscreen/data/load_data.py
def load_excel(
        file_path = None,
        sheet_name = 0,
        verbose = False,
        store_provenance = False,
        **kwargs,
        ):
    """Function to load data from excel file.

    Args:
    -------
        file_path: str
            Name of the path to the file
        sheet_name: int
            Number of the sheet in the excel file to load
        verbose: Boolean
            verbose flag
        store_provenance: Boolean
            To add!
        **kwargs: optional keyword arguments to pass to pandas' routine
            read_excel()

    Returns:
    -------
        data: pd.DataFrame
            Tabular data
        units: pd.DataFrame
            Tabular data on units

    Raises:
    -------
        ValueError: If `file_path` is not a valid file location

    Example:
    -------
       This function can be called with the file path of the example data as
       argument using:

        >>> from mibiscreen.data import load_excel
        >>> load_excel(example_data.xlsx)

    """
    if file_path is None:
        raise ValueError('Specify file path and file name!')
    if not os.path.isfile(file_path):
        raise OSError('Cannot access file at : ',file_path)

    data = pd.read_excel(file_path,
                         sheet_name = sheet_name,
                         **kwargs)
    if ";" in data.iloc[1].iloc[0]:
        data = pd.read_excel(file_path,
                             sep=";",
                             sheet_name = sheet_name,
                             **kwargs)

    units = data.drop(labels = np.arange(1,data.shape[0]))

    if verbose:
        print('==============================================================')
        print(" Running function 'load_excel()' on data file ", file_path)
        print('==============================================================')
        print("Unit of quantities:")
        print('-------------------')
        print(units)
        print('________________________________________________________________')
        print("Loaded data as pandas DataFrame:")
        print('--------------------------------')
        print(data)
        print('================================================================')

    return data, units

names_data

Name specifications of data!

File containing name specifications of quantities and parameters measured in groundwater samples useful for biodegredation and bioremediation analysis

@author: A. Zech

set_data

Functions for data extraction and merging in preparation of analysis and plotting.

@author: Alraune Zech

compare_lists(list1, list2, verbose=False)

Checking overlap of two given list.

Input
list1: list of strings
    given extensive list (usually column names of a pd.DataFrame)
list2: list of strings
    list of names to extract/check overlap with strings in list 'column'
verbose: Boolean, default True
    verbosity flag
Output
(intersection, remainder_list1, reminder_list2): tuple of lists
    * intersection: list of strings present in both lists 'list1' and 'list2'
    * remainder_list1: list of strings only present in 'list1'
    * remainder_list2: list of strings only present in 'list2'
Example:

list1 = [‘test1’,’test2’] list2 = [‘test1’,’test3’]

([‘test1’],[‘test2’]['test3']) = compare_lists(list1,list2)

Source code in mibiscreen/data/set_data.py
def compare_lists(list1,
                  list2,
                  verbose = False,
                  ):
    """Checking overlap of two given list.

    Input
    -----
        list1: list of strings
            given extensive list (usually column names of a pd.DataFrame)
        list2: list of strings
            list of names to extract/check overlap with strings in list 'column'
        verbose: Boolean, default True
            verbosity flag

    Output
    ------
        (intersection, remainder_list1, reminder_list2): tuple of lists
            * intersection: list of strings present in both lists 'list1' and 'list2'
            * remainder_list1: list of strings only present in 'list1'
            * remainder_list2: list of strings only present in 'list2'

    Example:
    -------
    list1 = ['test1','test2']
    list2 =  ['test1','test3']

    (['test1'],['test2']['test3']) = compare_lists(list1,list2)

    """
    intersection = list(set(list1) & set(list2))
    remainder_list1 = list(set(list1) - set(list2))
    remainder_list2 = list(set(list2) - set(list1))

    if verbose:
        print('================================================================')
        print(" Running function 'extract_variables()'")
        print('================================================================')
        print("strings present in both lists:", intersection)
        print("strings only present in either of the lists:", remainder_list1 +  remainder_list2)

    return (intersection,remainder_list1,remainder_list2)

determine_quantities(cols, name_list='all', verbose=False)

Determine quantities to analyse.

Input
cols: list
    Names of quantities from pd.DataFrame)
name_ist: str or list, dafault is 'all'
    either short name for group of quantities to use, such as:
            - 'all' (all qunatities given in data frame except settings)
            - 'BTEX' (for benzene, toluene, ethylbenzene, xylene)
            - 'BTEXIIN' (for benzene, toluene, ethylbenzene, xylene,
                          indene, indane and naphthaline)
            - 'all_cont' (for all contaminant in name list)
    or list of strings with names of quantities to use
verbose: Boolean
    verbose flag (default False)
Output
quantities: list
    list of strings with names of quantities to use and present in data
Source code in mibiscreen/data/set_data.py
def determine_quantities(cols,
         name_list = 'all',
         verbose = False,
         ):
    """Determine quantities to analyse.

    Input
    -----
        cols: list
            Names of quantities from pd.DataFrame)
        name_ist: str or list, dafault is 'all'
            either short name for group of quantities to use, such as:
                    - 'all' (all qunatities given in data frame except settings)
                    - 'BTEX' (for benzene, toluene, ethylbenzene, xylene)
                    - 'BTEXIIN' (for benzene, toluene, ethylbenzene, xylene,
                                  indene, indane and naphthaline)
                    - 'all_cont' (for all contaminant in name list)
            or list of strings with names of quantities to use
        verbose: Boolean
            verbose flag (default False)

    Output
    ------
        quantities: list
            list of strings with names of quantities to use and present in data

    """
    if name_list == 'all':
        ### choosing all column names except those of settings
        quantities = list(set(cols) - set(names.setting_data))
        if verbose:
            print("All data columns except for those with settings will be considered.")
        remainder_list2 = []

    elif isinstance(name_list, list): # choosing specific list of column names except those of settings
        quantities,remainder_list1,remainder_list2 = compare_lists(cols,name_list)

    elif isinstance(name_list, str) and (name_list in names.contaminants.keys()):
        if verbose:
            print("Choosing specific group of contaminants:", name_list)

        contaminants = names.contaminants[name_list].copy()

        # handling of xylene isomeres
        if (names.name_o_xylene in cols) and (names.name_pm_xylene in cols):
            contaminants.remove(names.name_xylene)

        quantities,remainder_list1,remainder_list2 = compare_lists(cols,contaminants)

    elif isinstance(name_list, str) and (name_list in names.electron_acceptors.keys()):
        if verbose:
            print("Choosing specific group of electron acceptors:", name_list)

        electron_acceptors = names.electron_acceptors[name_list].copy()

        quantities,remainder_list1,remainder_list2 = compare_lists(cols,electron_acceptors)

    elif isinstance(name_list, str):
        quantities,remainder_list1,remainder_list2 = compare_lists(cols,[name_list])

        if verbose:
            print("Choosing single quantity:", name_list)

    else:
        raise ValueError("Keyword 'name_list' in correct format")

    if not quantities:
        raise ValueError("No quantities from name list provided in data.\
                         Presumably data not in standardized format. \
                         Run 'standardize()' first.")

    if remainder_list2:
        print("WARNING: quantities from name list not in data:", *remainder_list2,sep='\n')
        print("Maybe data not in standardized format. Run 'standardize()' first.")
        print("_________________________________________________________________")

    if verbose:
        print("Selected set of quantities: ", *quantities,sep='\n')

    return quantities

extract_data(data_frame, name_list, keep_setting_data=True, verbose=False)

Extracting data of specified variables from dataframe.


data_frame: pandas.DataFrames
    dataframe with the measurements
name_list: list of strings
    list of column names to extract from dataframe
keep_setting_data: bool, default True
    Whether to keep setting data in the DataFrame.
verbose: Boolean
    verbose flag (default False)

data: pd.DataFrame
    dataframe with the measurements
Raises:

None (yet).

Example:

To be added.

Source code in mibiscreen/data/set_data.py
def extract_data(data_frame,
                 name_list,
                 keep_setting_data = True,
                 verbose = False,
                 ):
    """Extracting data of specified variables from dataframe.

    Args:
    -------
        data_frame: pandas.DataFrames
            dataframe with the measurements
        name_list: list of strings
            list of column names to extract from dataframe
        keep_setting_data: bool, default True
            Whether to keep setting data in the DataFrame.
        verbose: Boolean
            verbose flag (default False)

    Returns:
    -------
        data: pd.DataFrame
            dataframe with the measurements

    Raises:
    -------
    None (yet).

    Example:
    -------
    To be added.

    """
    ### check on correct data input format and extracting column names as list
    data,cols= check_data_frame(data_frame,inplace = False)

    quantities = determine_quantities(cols,
                                      name_list = name_list,
                                      verbose = verbose)

    if keep_setting_data:
        settings,r1,r2 = compare_lists(cols,names.setting_data)
        i1,quantities_without_settings,r2 = compare_lists(quantities,settings)
        columns_names = settings + quantities_without_settings

    else:
        columns_names = quantities

    return data[columns_names]

merge_data(data_frames_list, how='outer', on=[names.name_sample], clean=True, **kwargs)

Merging dataframes along columns on similar sample name.


data_frames_list: list of pd.DataFrame
    list of dataframes with the measurements
how: str, default 'outer'
    Type of merge to be performed.
    corresponds to keyword in pd.merge()
    {‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default ‘outer’
on: list, default "sample_nr"
    Column name(s) to join on.
    corresponds to keyword in pd.merge()
clean: Boolean, default True
    Whether to drop columns which are in all provided data_frames
    (on which not to merge, potentially other settings than sample_name)
**kwargs: dict
    optional keyword arguments to be passed to pd.merge()

data: pd.DataFrame
    dataframe with the measurements
Raises:

None (yet).

Example:

To be added.

Source code in mibiscreen/data/set_data.py
def merge_data(data_frames_list,
               how='outer',
               on=[names.name_sample],
               clean = True,
               **kwargs,
               ):
    """Merging dataframes along columns on similar sample name.

    Args:
    -------
        data_frames_list: list of pd.DataFrame
            list of dataframes with the measurements
        how: str, default 'outer'
            Type of merge to be performed.
            corresponds to keyword in pd.merge()
            {‘left’, ‘right’, ‘outer’, ‘inner’, ‘cross’}, default ‘outer’
        on: list, default "sample_nr"
            Column name(s) to join on.
            corresponds to keyword in pd.merge()
        clean: Boolean, default True
            Whether to drop columns which are in all provided data_frames
            (on which not to merge, potentially other settings than sample_name)
        **kwargs: dict
            optional keyword arguments to be passed to pd.merge()

    Returns:
    -------
        data: pd.DataFrame
            dataframe with the measurements

    Raises:
    -------
    None (yet).

    Example:
    -------
    To be added.

    """
    if len(data_frames_list)<2:
        raise ValueError('Provide List of DataFrames.')


    data_merge = data_frames_list[0]
    for data_add in data_frames_list[1:]:
        if clean:
            intersection,remainder_list1,remainder_list2 = compare_lists(
                data_merge.columns.to_list(),data_add.columns.to_list())
            intersection,remainder_list1,remainder_list2 = compare_lists(intersection,on)
            data_add = data_add.drop(labels = remainder_list1+remainder_list2,axis = 1)
        data_merge = pd.merge(data_merge,data_add, how=how, on=on,**kwargs)
        # complete data set, where values of porosity are added (otherwise nan)

    return data_merge

unit_settings

Unit specifications of data!

File containing unit specifications of quantities and parameters measured in groundwater samples useful for biodegredation and bioremediation analysis.

@author: Alraune Zech