Metage2Metabo-PostAViz’s API
Data structures
- class m2m_postaviz.data_struct.DataStorage(save_path: Path)[source]
Bases:
object- ABUNDANCE_FILE = 'abundance_file.tsv'
- ALL_FILE_NAMES = ('metadata_dataframe_postaviz.parquet.gzip', 'main_dataframe_postaviz.tsv', 'normalised_abundance_dataframe_postaviz.tsv', 'taxonomic_dataframe_postaviz.tsv', 'producers_cscope_dataframe.parquet.gzip', 'producers_iscope_dataframe.parquet.gzip', 'total_production_dataframe_postaviz.tsv', 'pcoa_dataframe_postaviz.tsv', 'abundance_file.tsv', 'sample_info.json', 'padmet_compounds_category_tree.json')
- HAS_ABUNDANCE_DATA: bool = False
- HAS_TAXONOMIC_DATA: bool = False
- ID_VAR = 'smplID'
- JSON_FILENAME = 'sample_info.json'
- USE_METACYC_PADMET: bool = False
- associate_bin_taxonomy(bin_list: list) list[source]
Associate for each bins in the list a taxonomic rank separated by <;>.
- Parameters:
bin_list (list) – _description_
- Returns:
list – _description_
- get_added_value_dataframe(cpd_input=None, sample_filter_mode='', sample_filter_value=None)[source]
Return Cscope producers dataframe, Iscope producers dataframe and the difference of these two dataframes.
- Parameters:
cpd_input (list, optional) – list of compounds of intereset. Defaults to None.
sample_filter_mode (str, optional) – Filter by sample mode. Defaults to “”.
sample_filter_value (list, optional) – Filter by sample list of value. Defaults to [].
- Returns:
pd.DataFrame – Tuple of three (producers) dataframes
- get_bin_dataframe(columns=None, condition=None, scope_mode='cscope') DataFrame[source]
Find the bin_dataframe file in the save_path of DataStorage object and read it with the condition given in args.
- Parameters:
columns (str, optional) – Columns label. Defaults to None.
condition (Tuple, optional) – Tuple of conditions. Defaults to None.
- Returns:
pd.DataFrame – Resulting bin_dataframe.
- get_bin_list_from_taxonomic_rank(rank, choice)[source]
Return a list of bins corresponding to the taxonomic rank given in input.
EXAMPLE : taxonomic rank = order, choice = Clostideria.
- Parameters:
rank (str) – Taxonomic rank
choice (str) – one of the unique choice in taxonomic rank
- Returns:
list – list of bins in the taxonomic scope
- get_compounds_from_category(data, results)[source]
Find and return in a list all the leaf of the tree. each leaf is a compounds A compounds has not children, but work need te bo done to be sure that category node that do not have any children (not supposed to) will be in the result list.
- Parameters:
data (dict) – Tree
results (list, optional) – List used as transport of results between recursive. Ignore and let it to default. Defaults to [].
- Returns:
list – List of childless node found in tree (compounds).
- get_metacyc_category_list(tree=None)[source]
Return the category list of the metacyc database. By default it return the list of the category of the whole tree. If any sub tree is given it return only the sub category of that tree.
- Parameters:
tree (Dict, optional) – Sub tree from to get the keys from if None takes the whole tree. Defaults to None.
- Returns:
List – _description_
- get_outsider_cpd()[source]
Return the compounds found in data but doesnt fit in OTHERS category
- Returns:
Tuple – cpd list / category names
- get_sub_tree_recursive(data, id, results)[source]
Search throught the tree for a match between key and id. Return only the part of the tree with the node id as the root.
- Parameters:
data (dict) – original Tree.
id (str) – ID of the node.
results (list, optional) – List used as transport of results between recursive. Ignore and let it to default. Defaults to [].
- Returns:
list – list containing the dictionary of the node.
- load_files(load_path: Path)[source]
Loop through files in save directory and return a dictionnary of True/false for each files.
If necessary files are not present RaiseRuntimeError
- Parameters:
load_path (_type_) – _description_
- Raises:
RuntimeError – If required files are absent.
- Returns:
dict – _description_
- open_tsv(key: str)[source]
Return the dataframe corresponding to the key given as input.
- Parameters:
key (str) – name of dataframe’s file
- Returns:
pd.Dataframe – Pandas dataframe
- read_parquet_with_pandas(path: Path, col: list | None = None, condition: list | None = None) DataFrame[source]
Transfer the column choice and condition as keyword-arguments to the pandas read parquet function.
- Parameters:
path (str) – path of the parquet file.
col (Optional[list], optional) – Label of the column to open. Defaults to None.
condition (Optional[list], optional) – Nested tuple used to select only rows who matches the conditions. Defaults to None.
- Returns:
pd.DataFrame – _description_
- save_dataframe(df_to_save, file_name: str, extension: str = '.tsv')[source]
Save the dataframe in input. Check for already saved file and change the name accordingly.
- Parameters:
df_to_save (pd.DataFrame) – _description_
file_name (str) – _description_
extension (str, optional) – _description_. Defaults to “tsv”.
- Returns:
_type_ – _description_
- m2m_postaviz.data_utils.bin_dataframe_build(scope_directory: Path, scope_mode: str = 'cscope', abundance_path: Path | None = None, taxonomy_path: Path | None = None, savepath: Path | None = None)[source]
Build a large dataframe with all the bins of the different samples as index, the dataframe contain the list of production, the abundance fot he bin in the sample and the count of production with or without abundance.
- Parameters:
sample_info (dict) – _description_
sample_data (dict) – _description_
metadata (Dataframe) – _description_
abundance_file (Dataframe, optional) – _description_. Defaults to None.
taxonomy_path (Dataframe, optional) – _description_. Defaults to None.
- Returns:
pd.DataFrame – Pandas dataframe
- m2m_postaviz.data_utils.build_dataframes(dir_path: Path, metadata_path: Path, abundance_path: Path | None = None, taxonomic_path: Path | None = None, save_path: Path | None = None, metacyc: Path | None = None)[source]
Main function. dir_path, metadata_path and save_path are necessary. Generate most of the core dataframes to avoid calculation on the application side.
- Parameters:
dir_path (Path) – Directory path containing M2M output.
metadata_path (Path) – Metadata file path.
abundance_path (Optional[Path], optional) – Abundance file path. Defaults to None.
taxonomic_path (Optional[Path], optional) – Taxonomic file path. Defaults to None.
save_path (Optional[Path], optional) – Output path. Defaults to None.
metacyc (Optional[Path], optional) – Metacyc DB file path. Defaults to None.
- m2m_postaviz.data_utils.build_main_dataframe(save_path: Path, cscope_directory: Path)[source]
Create and save the main dataframe. Samples in rows and compounds in columns. It takes the compounds production in each samples cscope and return a pandas Series with 1 produced or 0 absent for each compounds. Merge all the series returned into a dataframe.
- Parameters:
sample_data (dict) – Samples’s cscope.
save_path (_type_) – Save path given in CLI.
- m2m_postaviz.data_utils.build_parent_child_dataframe(padmet: PadmetRef, dataframe: DataFrame, current_id, child_column='child_id', parent_column='parent_id')[source]
Build a child /parent relation dataframe between compounds category from a metacyc database in a padmet file format.
- Parameters:
padmet (PadmetRef) – PadmetRef object from padmet package.
dataframe (pd.DataFrame) – Transmission of the dataframe between recursive call.
current_id (_type_) – ID of the current category / compound.
child_column (str, optional) – Column label of the child column of the dataframe. Defaults to “child_id”.
parent_column (str, optional) – Column label of the parent column of the dataframe. Defaults to “parent_id”.
- m2m_postaviz.data_utils.build_pcoa_dataframe(save_path: Path) DataFrame[source]
Compute Principal Coordinate Analysis from the main_dataframe given in input. Merge with metadata from the smplID column or index.
- Parameters:
main_dataframe (pd.DataFrame) – Dataframe from which the pcoa will be made.
metadata (pd.DataFrame) – Metadata dataframe (Must have smplID identifer column for the merge to work)
- Returns:
pd.DataFrame – Pcoa dataframe with sample ID as index, PC1 and PC2 results and all metadata.
- m2m_postaviz.data_utils.build_tree_from_root(node, id, df)[source]
Build a tree from a dataframe and a dictionary with the first key as root. The first key is the first parent node from which the tree will be built starting with its first child. Any node that is not connected indirectly with the root node won’t be in the tree.
Example : root = {}
root[“FRAMES”] = {}
build_tree_from_root(root[“FRAMES”], “FRAMES”, dataframe)
- Parameters:
node (dict) – root node
id (str) – Root node key id, correspond to the string of the first node in the dataframe.
df (pd.DataFrame) – Dataframe with 2 columns: columns names must be child_id and parent_id. child_id column has only unique values.
- m2m_postaviz.data_utils.concat_chunk(chunk_dir: Path, save_path: Path, scope_type: str)[source]
Concatenation of all sub_dataframes produced.
- Parameters:
chunk_dir (Path) – Directory path where the chunk are.
save_path (Path) – Save result path
scope_type (str) – Cscope or Iscope
- m2m_postaviz.data_utils.correlation_test(value_array, factor_array, factor_name, method: str = 'pearson')[source]
- m2m_postaviz.data_utils.get_significance_symbol(pval: float) str[source]
Return Significance symbol depending on pvalue given.
- Parameters:
pval (float) – Pvalue of the test
- Returns:
str – Significance’s symbol
- m2m_postaviz.data_utils.has_only_unique_value(dataframe, input1, input2: str = 'None')[source]
Return True if the dataframe’s column(s) only has unique value, False otherwise.
- Parameters:
dataframe (pd.DataFrame) – _description_
column_value (_type_) – _description_
input1 (_type_) – _description_
input2 (str, optional) – _description_. Defaults to “None”.
- m2m_postaviz.data_utils.is_valid_dir(dirpath: Path)[source]
Return True if directory exists or not
- Parameters:
dirpath (str) – path of directory
- Returns:
bool – True if dir exists, False otherwise
- m2m_postaviz.data_utils.load_sample_cscope_data(dir_path: Path, cscope_directory: Path, cscope_file_format, save_path: Path)[source]
Open all directories given in -d path input. Get all cscopes, load and save them a dataframe in parquet.gzip format. No RAM used during process that way.
- Parameters:
path (str) – Path of directory
- Returns:
dict – sample_data dictionnary
- m2m_postaviz.data_utils.load_sample_iscope_data(dir_path: Path, iscope_directory: Path, iscope_file_format)[source]
Load and save iscope data as dataframe in parquet.gzip format.
- Parameters:
dir_path (Path) – Directory path given in cli (-d)
iscope_directory (Path) – Path of newly created save directory.
iscope_file_format (bool) – Format to save.
- m2m_postaviz.data_utils.metadata_processing(metadata_path: Path, save_path: Path)[source]
Simple function to save the metadata as parquet file. allow for dtypes change of the file in application while creating a safe copy of the original file.
- Parameters:
metadata_path (Path) – Path to metadata file.
save_path (Path) – Saving path.
- Returns:
None if file already exist in save_path.
- m2m_postaviz.data_utils.open_tsv(file_name: str, convert_cpd_id: bool = False, rename_columns: bool = False, first_col: str = 'smplID')[source]
Open tsv file as a pandas dataframe.
- Parameters:
file_name (str) – Path of the file
rename_columns (bool, optional) – Rename the first column and decode the metabolites names in sbml format into readable format. Defaults to False.
first_col (str, optional) – Label of the first col if rename_columns is True. Defaults to “smplID”.
- Returns:
Dataframe – Pandas dataframe
- m2m_postaviz.data_utils.padmet_to_tree(save_path: Path, metacyc_file_path: Path)[source]
Build a tree to be used in the Shiny application. Allow the user to select directly a compounds or a category of compounds and fill a list with all the compounds corresponding to that category.
Use the function build_parent_child_dataframe to create a 2 columns (child_id/parent_id) dataframe. With the relation dataframe, build the tree using build_tree_from_root.
- Parameters:
save_path (str) – Path of the save directory.
- m2m_postaviz.data_utils.preprocessing_for_statistical_tests(dataframe: DataFrame, y_value, input1, input2=None, multipletests: bool = False, multipletests_method: str = 'bonferroni')[source]
Create dataframe for each y_value in the list, to separate them and use wilcoxon_man_whitney function. Concat all results into one dataframe.
- Parameters:
dataframe (pd.DataFrame) – Dataframe to test.
y_value (_type_) – list of columns labels to separate into several dataframe. Must be at least of lenght 1.
input1 (_type_) – First user’s input.
input2 (_type_, optional) – Second user’s input. Defaults to None.
- Returns:
Dataframe – Dataframe of statistical test.
- m2m_postaviz.data_utils.producers_dataframe(scope_directory: Path, save_path: Path, scope_type: str)[source]
- m2m_postaviz.data_utils.relative_abundance(abundance_path: Path, save_path: Path, cscope_dir: Path, scope: str)[source]
Generate a second main_dataframe with the production based on weight from the abundance matrix.
- Parameters:
abundance_matrix (Path) – Pathlib Path of the abundance file.
sample_cscope (Path) – Pathlib Path of the cscope directory.
save_path (Path) – Pathlib Path of the output directory.
- Raises:
RuntimeError – If more than one column of type other than numeric.
- Returns:
Dataframe – production dataframe with sample in rows and compounds in column. Weighted by abundance.
- m2m_postaviz.data_utils.retrieve_all_cscope(sample, dir_path: Path, cscope_directoy: Path, cscope_file_format)[source]
Retrieve iscope, cscope, added_value and contribution_of_microbes files in the path given using os.listdir().
- Parameters:
path (str) – Directory path
- Returns:
dict – Return a nested dict object where each key is a dictionnary of a sample. The key of those second layer dict [iscope, cscope, advalue, contribution] give acces to these files.
- m2m_postaviz.data_utils.retrieve_all_iscope(sample, dir_path, iscope_directoy, iscope_file_format)[source]
Retrieve iscope, cscope, added_value and contribution_of_microbes files in the path given using os.listdir().
- Parameters:
path (str) – Directory path
- Returns:
dict – Return a nested dict object where each key is a dictionnary of a sample. The key of those second layer dict [iscope, cscope, advalue, contribution] give acces to these files.
- m2m_postaviz.data_utils.sum_and_concat_by_chunk(directory_path: Path)[source]
Produce dataframe from chunk of 250 samples, BETTER memory usage small performance price.
- Parameters:
directory_path (Path) – _description_
- m2m_postaviz.data_utils.taxonomy_processing(taxonomy_filepath: Path, save_path: Path)[source]
Open and save taxonomy file.
- Parameters:
taxonomy_filepath (str) – TSV or TXT format
- Raises:
RuntimeError – Wrong file’s format
- Returns:
pd.DataFrame – Pandas dataframe
- m2m_postaviz.data_utils.total_production_by_sample(save_path: Path, abundance_path: Path | None = None)[source]
Create and save the total production dataframe. This dataframe contain all samples in row and all compounds in columns. For each samples the compounds produced by each bins is sum up to get the estimated total production of compound by samples and the number of bins who produced those compounds.
If the abundance is provided, each production (1) of bins is multiplied by their abundance in their sample which gives an estimated production of compounds weighted by the abundance of the bin producer.
- Parameters:
save_path (_type_) – Save path given in CLI
abundance_path (str, optional) – Abundance file path fiven in CLI. Defaults to None.
- m2m_postaviz.data_utils.wilcoxon_man_whitney(dataframe: DataFrame, y, first_factor: str, second_factor: str | None = None, multiple_correction: bool = False, correction_method: str = 'hs')[source]
Takes one dataframe with only one value column y and return a dataframe of statistical tests. First all sub arrays by the first input then the second input are made and convert to numpy array. Then Wilcoxon or Mann Whitney test are run on each pair without doublon. If pairs array have the same lenght -> Wilcoxon, if not -> Mann Whitney
Args: dataframe (pd.Dataframe): Pandas dataframe y (str): Column label containing the values to test. first_factor (str): Column label of the first user’s input. second_factor (str): Column label of the second user’s input. Default to None
- Returns:
Dataframe – Dataframe of test’s results.
Overview exploration
Exploration of metabolites
Exploration of taxa
Lineage and taxonomy
Shiny app
- m2m_postaviz.shiny_module.bin_exploration_processing(data: DataStorage, factor, factor_choice, rank, rank_choice, with_abundance, color, group_by_metadata=False, save_raw_data=False)[source]
Takes inputs from shiny application to return 3 ploty objects: - hist plot of the unique production of metabolites by selected bins, weighted by abundance or not. - box plot of production of metabolites by bin selected. - bar plot of the abundance of each bin by samples.
Each plot can be customised by the metadata from the input selected by user.
A pre-processing is needed first to get only the bins of interest from the chunks of bins_dataframe from hard drive.
- Parameters:
data (DataStorage) – Data object giving access to the dataframe in disk.
factor (str) – Column of the metadata selected for filtering.
factor_choice (str) – One or several unique value from the column factor selected.
rank (str) – The taxonomic rank selected.
rank_choice (str) – The unique value of the taxonomic rank selected.
with_abundance (bool) – If the production value of the bins should be weighted by their abundance in their sample.
color (str) – Column of the metadata selected to group result by color.
- Returns:
tuple – (Tuple(bin_production_plot_cscope, bin_production_plot_iscope), Abundance_plot, time)
- m2m_postaviz.shiny_module.cpd_reached_plot(data: DataStorage, metadata_input: str, multiple_correction, correction_method)[source]
Produce and return a plotly.express boxplot of the compounds reached by the sample in community, individually or not reached. The plot can be grouped by the metadata. :Parameters: * data (DataStorage) – DataStorage object.
metadata_input (str) – Metadata column label.
- Returns:
plotly.express.boxplot – Plotly boxplot
- m2m_postaviz.shiny_module.get_significance_symbol(pval: float) str[source]
Return Significance symbol depending on pvalue given.
- Parameters:
pval (float) – Pvalue of the test
- Returns:
str – Significance’s symbol
- m2m_postaviz.shiny_module.global_production_statistical_dataframe(data: DataStorage, user_input1, user_input2, multiple_test_correction, correction_method, with_abundance)[source]
- m2m_postaviz.shiny_module.make_pcoa(data: DataStorage, column, choices, abundance, color)[source]
Produce a Principal Coordinate Analysis with data. The Pcoa can be customized by filtering on specific column, using the abundance data and color the resulting plot.
- Parameters:
data (DataStorage) – DataStorage Object.
column (str) – Column label used for filtering.
choices (list) – Choice of the unique of the column input to use.
abundance (bool) – Option to use the column with abundance values instead of the {0 not produced ,1 produced} values.
color (str) – Column label used for the color option of the plot.
- Returns:
px.scatter – Plotly scatter figure.
- m2m_postaviz.shiny_module.metabolites_production_statistical_dataframe(data: DataStorage, metabolites_choices, user_input1, user_input2, multiple_test_correction, correction_method, save_raw_data)[source]
- m2m_postaviz.shiny_module.percentage_smpl_producing_cpd(data: DataStorage, cpd_input: list, metadata_filter_input: str, sample_filter_button='All', sample_filter_value=[], save_raw_data=False)[source]
Produce two plotly figure barplot from the list of compounds and the column filter given in input.
- Parameters:
data (DataStorage) – DataStorage object
cpd_input (list) – List of compounds input
metadata_filter_input (str) – Column label of metadata filter
sample_filter_button – Enable row filtering by sample’s ID of value of metadata.
- Returns:
Tuple – Tuple with cscope plot and iscope plot
- m2m_postaviz.shiny_module.reached_compounds_plot_stats_tests(df, metadata_input, multiple_correction, correction_method)[source]
Takes the reached_cpd_plot dataframe to process the different combination for the Wilcoxon/Whitney test.
- Parameters:
df (pl.Dataframe) – Dataframe of the plot.
metadata_input (_type_) – Metadata column choosed in input.
- Raises:
TypeError – The dataframe in input must be a Polars dataframe.
- Returns:
Dataframe – Dataframe of the results
- m2m_postaviz.shiny_module.render_reactive_metabolites_production_plot(data: DataStorage, compounds_input, user_input1, color_input='None', sample_filter_button='All', sample_filter_value=[], with_abundance=None, save_raw_data=False)[source]
- m2m_postaviz.shiny_module.render_reactive_total_production_plot(data: DataStorage, user_input1, user_input2, with_abundance)[source]
Produce and return a plotly figure object. Barplot or Boxplot if there is only unique value in columns.
- Parameters:
data (DataStorage) – DataStorage object.
user_input1 (_type_) – Column label for metadata filtering.
user_input2 (_type_) – Column label for metadata filtering.
with_abundance (bool) – Option to use the column with abundance values instead of the {0 not produced ,1 produced} values.
- Returns:
px.box – Plotly express object. pd.Dataframe: dataframe used for the plot.
- m2m_postaviz.shiny_module.run_pcoa(main_dataframe: DataFrame, metadata: DataFrame, distance_method: str = 'jaccard')[source]
Calculate Principal Coordinate Analysis with the dataframe given in args. Use metadata’s drataframe as second argument to return the full ordination result plus all metadata column inserted along Ordination.samples dataframe. Ready to be plotted.
- Parameters:
main_df (pd.DataFrame) – Main dataframe of compound production
metadata (pd.DataFrame) – Metadata’s dataframe
- Returns:
pd.DataFrame – Ordination results object from skbio’s package.
- m2m_postaviz.shiny_module.sns_clustermap(data: DataStorage, cpd_input, metadata_input=None, row_cluster=False, col_cluster=False, filter_mode=None, filter_values=None, save_raw_data=False)[source]
Produce a customizable Seaborn clustermap. Distance matrix use the jaccard method when clustering enabled.
- Parameters:
data (DataStorage) – DataStorage object.
cpd_input (list) – list of compounds input to filter.
metadata_input (str, optional) – Column label to filter sample by their metadata. Add a ROW color. Defaults to None.
row_cluster (bool, optional) – Dendogram for rows from distance matrix. Defaults to False.
col_cluster (bool, optional) – Dendogram for cols from distance matrix. Defaults to False.
filter_mode (str, optional) – Mode of sample’s filter if enabled. Defaults to None.
filter_values (list, optional) – list of samples to filter. Defaults to None.
- Returns:
list – List of three clustermap matrix object.
- m2m_postaviz.shiny_module.split_value_column_with_metadata(df, metadata_column)[source]
Split the metadata column of the dataframe into dictionnary whose keys are the unique value of the column and value as a list of value from “value” column corresponding to th metadata value.
Example:
┌────────────┬──────┬───────┐ │ smplID ┆ Days ┆ value │ │ — ┆ — ┆ — │ │ str ┆ i64 ┆ i64 │ ╞════════════╪══════╪═══════╡ │ ERAS1d0 ┆ 0 ┆ 1027 │ │ ERAS2d0 ┆ 0 ┆ 1021 │ │ ERAS3d0 ┆ 0 ┆ 942 │ │ ERAS4d0 ┆ 0 ┆ 1086 │ │ ERAS5d0 ┆ 0 ┆ 1069 │ │ ERAS8d180 ┆ 180 ┆ 1034 │ │ ERAS9d180 ┆ 180 ┆ 1040 │ │ ERAS10d180 ┆ 180 ┆ 1061 │ │ ERAS11d180 ┆ 180 ┆ 1105 │ │ ERAS12d180 ┆ 180 ┆ 1027 │ └────────────┴──────┴───────┘
Expected result: {0: [1027, 1021, 942, 1086, 1069, 1034], 180: [1034, 1034, 1040, 1061, 1105, 1027]}
- Parameters:
df (pl.Dataframe) – Polars dataframe.
- m2m_postaviz.shiny_module.wilcoxon_mann_whitney(data, metadata_input, context, multiple_correction, multiple_correction_method)[source]
Receive a dictionnary whose key/value be used to apply wilcoxon test if they value array are the same lenght. Mann-Whitney otherwise.
- Parameters:
data (Dict) – Dictionnary of the Metadata {unique metadata value : [ value array ]}
metadata_input (_type_) – _description_
- Returns:
pd.Dataframe – Pandas dataframe of the resulting tests.