THOI is a Python package designed to compute O information in Higher Order Interactions using batch processing. This package leverages PyTorch for efficient tensor operations.
Installing THOI with your prefered Versions of PyTorch
Because PyTorch installation can depend on the user environment and requirements (GPU or CPU support or a specific version of PyTorch), you need to install PyTorch separately before installing THOI. Follow these steps:
After installation, you can start using THOI in your projects. Here is a simple example:
fromthoi.measures.gaussian_copulaimportmulti_order_measures,nplets_measuresfromthoi.heuristicsimportsimulated_annealing,greedyimportnumpyasnpX=np.random.normal(0,1,(1000,10))# Computation of O information for the nplet that consider all the variables of Xmeasures=nplets_measures(X)# Computation of O info for a single nplet (it must be a list of nplets even if it is a single nplet)measures=nplets_measures(X,[[0,1,3]])# Computation of O info for multiple npletsmeasures=nplets_measures(X,[[0,1,3],[3,7,4],[2,6,3]])# Extensive computation of O information measures over all combinations of features in Xmeasures=multi_order_measures(X)# Compute the best 10 combinations of features (nplet) using greedy, starting by exaustive search in# lower order and building from there. Result shows best O information for# each built optimal ordersbest_nplets,best_scores=greedy(X,3,5,repeat=10)# Compute the best 10 combinations of features (nplet) using simulated annealing: There are two initialization options# 1. Starting by a custom initial solution with shape (repeat, order) explicitely provided by the user.# 2. Selecting random samples from the order.# Result shows best O information for each built optimal ordersbest_nplets,best_scores=simulated_annealing(X,5,repeat=10)
For detailed usage and examples, please refer to the documentation.
We welcome contributions from the community. If you encounter any issues or have suggestions for improvements, please open an issue or submit a pull request on GitHub.
@misc{thoi,author={Laouen Belloli and Rubén Herzog},title={THOI: An efficient and accessible library for computing higher-order interactions enhanced by batch-processing},year={2024},url={https://pypi.org/project/thoi/}}
APA
Belloli, L., & Herzog, R. (2023). THOI: An efficient library for higher order interactions analysis based on Gaussian copulas enhanced by batch-processing. Retrieved from https://pypi.org/project/thoi/
MLA
Belloli, Laouen, and Rubén Herzog. THOI: An efficient library for higher order interactions analysis based on Gaussian copulas enhanced by batch-processing. 2023. Web. https://pypi.org/project/thoi/.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the “Software”), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
Compute multi-order measures (TC, DTC, O, S) for the given data matrix X.
The measurements computed are:
Total Correlation (TC)
Dual Total Correlation (DTC)
O-information (O)
S-information (S)
Parameters:
X (TensorLikeArray) – Input data, which can be one of the following:
- A single torch.Tensor or np.ndarray with shape (T, N).
- A sequence (e.g., list) of torch.Tensor or np.ndarray, each with shape (T, N), representing multiple datasets.
- A sequence of sequences, where each inner sequence is an array-like object of shape (T, N).
If covmat_precomputed is True, X should be:
- A single torch.Tensor or np.ndarray covariance matrix with shape (N, N).
- A sequence of covariance matrices, each with shape (N, N).
min_order (int, optional) – Minimum order to compute. Default is 3. Note: 3 <= min_order <= max_order <= N.
max_order (int, optional) – Maximum order to compute. If None, uses N (number of variables). Default is None. Note: min_order <= max_order <= N.
covmat_precomputed (bool, optional) – If True, X is treated as covariance matrices instead of raw data. Default is False.
T (int or list of int, optional) – Number of samples used to compute bias correction. This parameter is used only if covmat_precomputed is True.
If X is a sequence of covariance matrices, T should be a list of sample sizes corresponding to each matrix.
If T is None and covmat_precomputed is True, bias correction is not applied. Default is None.
batch_size (int, optional) – Batch size for DataLoader. Default is 1,000,000.
device (torch.device, optional) – Device to use for computation. Default is torch.device(‘cpu’).
num_workers (int, optional) – Number of workers for DataLoader. Default is 0.
batch_aggregation (callable, optional) – Function to aggregate the collected batch data into the final result.
It should accept a list of outputs from batch_data_collector and return the final aggregated result.
The return type of this function determines the return type of multi_order_measures.
By default, it uses concat_and_sort_csv, which concatenates CSV data and sorts it, returning a pandas DataFrame.
For more information see Collectors Concat and sort CSV
batch_data_collector (callable, optional) –
Function to process and collect data from each batch.
It should accept the following parameters:
nplets: torch.Tensor of n-plet indices, shape (batch_size, order)
nplets_tc: torch.Tensor of total correlation values, shape (batch_size, D)
nplets_dtc: torch.Tensor of dual total correlation values, shape (batch_size, D)
nplets_o: torch.Tensor of O-information values, shape (batch_size, D)
nplets_s: torch.Tensor of S-information values, shape (batch_size, D)
batch_number: int, the current batch number
The output of batch_data_collector must be compatible with the input expected by batch_aggregation.
By default, it uses batch_to_csv, which collects data into CSV. For more information see Collectors Batch to CSV
Returns:
Any – The aggregated result of the computed measures. The exact type depends on the batch_aggregation function used.
By default, it returns a pandas DataFrame containing the computed metrics (DTC, TC, O, S), the n-plets indexes,
the order and the dataset information.
Where
—–
D (int) – Number of datasets. If X is a single dataset, D = 1.
N (int) – Number of variables (features) in each dataset.
T (int) – Number of samples in each dataset (if applicable).
order (int) – The size of the n-plets being analyzed, ranging from min_order to max_order.
batch_size (int) – Number of n-plets processed in each batch.
Notes
The default batch_data_collector and batch_aggregation functions are designed to work together.
If you provide custom functions, ensure that the output of batch_data_collector is compatible with the input of batch_aggregation.
Ensure that the length of T matches the number of datasets when covmat_precomputed is True and X is a sequence of covariance matrices.
The function computes measures for all combinations of variables of orders ranging from min_order to max_order.
The function is optimized for batch processing using PyTorch tensors, facilitating efficient computations on large datasets.
Examples
Using default batch data collector and aggregation:
Compute higher-order measures (TC, DTC, O, S) for specified n-plets in the given data matrices X.
The computed measures are:
Total Correlation (TC)
Dual Total Correlation (DTC)
O-information (O)
S-information (S)
Parameters:
X (TensorLikeArray) – Input data, which can be one of the following:
- A single torch.Tensor or np.ndarray with shape (T, N).
- A sequence (e.g., list) of torch.Tensor or np.ndarray, each with shape (T, N), representing multiple datasets.
- A sequence of sequences, where each inner sequence is an array-like object of shape (T, N).
If covmat_precomputed is True, X should be:
- A single torch.Tensor or np.ndarray covariance matrix with shape (N, N).
- A sequence of covariance matrices, each with shape (N, N).
nplets (TensorLikeArray, optional) – The n-plets to calculate the measures, with shape (n_nplets, order). If None, all possible n-plets of the given order are considered.
covmat_precomputed (bool, optional) – If True, X is treated as covariance matrices instead of raw data. Default is False.
T (int or list of int, optional) – Number of samples used to compute bias correction. This parameter is used only if covmat_precomputed is True.
If X is a sequence of covariance matrices, T should be a list of sample sizes corresponding to each matrix.
If T is None and covmat_precomputed is True, bias correction is not applied. Default is None.
device (torch.device, optional) – Device to use for computation. Default is torch.device(‘cpu’).
verbose (int, optional) – Logging verbosity level. Default is logging.INFO.
batch_size (int, optional) – Batch size for processing n-plets. Default is 1,000,000.
Returns:
torch.Tensor – Tensor containing the computed measures for each n-plet with shape (n_nplets, D, 4)
Where
—–
D (int) – Number of datasets. If X is a single dataset, D = 1.
N (int) – Number of variables (features) in each dataset.
T (int) – Number of samples in each dataset (if applicable).
order (int) – The size of the n-plets being analyzed.
n_nplets (int) – Number of n-plets processed.
Examples
Compute measures for all possible 3-plets in a single dataset:
Brief: Compute the higher order measurements (tc, dtc, o and s) for the given data matrices X over the nplets.
Parameters:
- X TensorLikeArray: The input data to compute the nplets. It can be a list of 2D numpy arrays or tensors of shape: 1. (T, N) where T is the number of samples if X are multivariate series. 2. a list of 2D covariance matrices with shape (N, N).
- nplets (Optional[Union[np.ndarray,torch.Tensor]]): The nplets to calculate the measures with shape (batch_size, order)
- covmat_precomputed (bool): A boolean flag to indicate if the input data is a list of covariance matrices or multivariate series.
- T (Optional[Union[int, List[int]]]): A list of integers indicating the number of samples for each multivariate series.
- device (torch.device): The device to use for the computation. Default is ‘cpu’.
- batch_size (int): Batch size for processing n-plets. Default is 100,000.
Returns:
- torch.Tensor: The measures for the nplets with shape (n_nplets, D, 4) where D is the number of matrices, n_nplets is the number of nplets to calculate over and 4 is the number of metrics (tc, dtc, o, s)
Convert batch results to a pandas DataFrame and optionally save to CSV.
This function processes the measures computed for n-plets in a batch and converts them into a pandas DataFrame.
It can also save the DataFrame to a CSV file if an output path is provided.
param nplets_idxs:
Indices of the n-plets. Shape: (batch_size, order).
type nplets_idxs:
torch.Tensor
param nplets_tc:
Total correlation values. Shape: (batch_size, D).
type nplets_tc:
torch.Tensor
param nplets_dtc:
Dual total correlation values. Shape: (batch_size, D).
type nplets_dtc:
torch.Tensor
param nplets_o:
O-information values. Shape: (batch_size, D).
type nplets_o:
torch.Tensor
param nplets_s:
S-information values. Shape: (batch_size, D).
type nplets_s:
torch.Tensor
param bn:
Batch number, used for identification in output files.
type bn:
int
param only_synergetic:
If True, only includes n-plets with negative O-information (synergetic). Default is False.
type only_synergetic:
bool, optional
param columns:
Names of the variables (features). If None, variable names will be generated as ‘var_0’, ‘var_1’, …, ‘var_N-1’.
type columns:
list of str, optional
param N:
Total number of variables. Required if columns is not provided.
type N:
int, optional
param sep:
Separator to use in the CSV file. Default is tab (’t’).
type sep:
str, optional
param indexing_method:
Method used to represent n-plets. Can be ‘indexes’ or ‘hot_encoded’. Default is ‘indexes’.
type indexing_method:
str, optional
param output_path:
Path to save the CSV file. If None, the DataFrame is returned instead of being saved.
type output_path:
str, optional
returns:
pd.DataFrame or None – DataFrame containing the measures and variable information for the n-plets.
Returns None if output_path is provided and the DataFrame is saved to a file.
Where
—–
D (int) – Number of datasets. If measures are computed over multiple datasets, D > 1.
N (int) – Number of variables (features).
batch_size (int) – Number of n-plets in the batch.
order (int) – Order of the n-plets (number of variables in each n-plet).
Notes
The function can filter out n-plets with non-negative O-information if only_synergetic is True.
The resulting DataFrame includes the measures and a binary indicator for each variable indicating its presence in the n-plet.
The DataFrame also includes columns for ‘order’ and ‘dataset’.
bn (int, optional) – Batch number. Not used in the function but kept for compatibility.
top_k (int, optional) – If provided, selects the top-k n-plets based on the specified metric.
metric (string with value 'dtc', 'tc', 'o' or 's' or Callable, optional) – Metric to use for ranking if top_k is provided. Default is ‘o’ (O-information).
largest (bool, optional) – If True, selects n-plets with the largest metric values if false return n-plets with the smalest values. Default is False.
Select the top-k n-plets based on a specified metric.
Parameters:
nplets_idxs (torch.Tensor) – Indices of the n-plets. Shape: (batch_size, order).
nplets_measures (torch.Tensor) – Measures for each n-plet. Shape: (batch_size, D, 4).
k (int) – Number of top n-plets to select.
metric (string with value 'dtc', 'tc', 'o' or 's' or Callable) – Metric to use for ranking the n-plets. Can be a string specifying a measure (‘tc’, ‘dtc’, ‘o’, ‘s’),
or a custom callable that takes nplets_measures and returns a tensor of values.
largest (bool) – If True, selects n-plets with the largest metric values if false return n-plets with the smalest values. Default is False.
Returns:
Tuple[torch.Tensor, torch.Tensor, torch.Tensor] –
Selected n-plets measures. Shape: (k, D, 4).
Selected n-plets indices. Shape: (k, order).
Metric values of the selected n-plets. Shape: (k,).
Where
—–
D (int) – Number of datasets.
order (int) – Order of the n-plets (number of variables in each n-plet).
batch_size (int) – Number of n-plets in the batch.
Notes
The function computes the specified metric for each n-plet and selects the top-k based on this metric.
The metric can be one of the predefined measures or a custom function.
Examples
```python
# Sample data
nplets_idxs = torch.tensor([[0, 1], [1, 2], [0, 2]])
nplets_measures = torch.rand(3, 1, 4) # Assuming D=1
k = 2
metric = ‘o’ # Use O-information for ranking
# Get top-k n-plets
top_measures, top_idxs, top_values = top_k_nplets(
Greedy algorithm to find the best order of n-plets to maximize the metric for a given multivariate series or covariance matrices.
Parameters:
X (TensorLikeArray) – The input data to compute the n-plets. It can be a list of 2D numpy arrays or tensors of shape:
1. (T, N) where T is the number of samples if X are multivariate series.
2. A list of 2D covariance matrices with shape (N, N).
initial_order (int, optional) – The initial order to start the greedy algorithm. Default is 3.
order (int, optional) – The final order to stop the greedy algorithm. If None, it will be set to N.
covmat_precomputed (bool, optional) – A boolean flag to indicate if the input data is a list of covariance matrices or multivariate series. Default is False.
T (int or list of int, optional) – A list of integers indicating the number of samples for each multivariate series. Default is None.
repeat (int, optional) – The number of repetitions to do to obtain different solutions starting from less optimal initial solutions. Default is 10.
batch_size (int, optional) – The batch size to use for the computation. Default is 1,000,000.
repeat_batch_size (int, optional) – The batch size for repeating the computation. Default is 1,000,000.
device (torch.device, optional) – The device to use for the computation. Default is ‘cpu’.
metric (Union[str, Callable], optional) – The metric to evaluate. One of ‘tc’, ‘dtc’, ‘o’, ‘s’ or a callable function. Default is ‘o’.
largest (bool, optional) – A flag to indicate if the metric is to be maximized or minimized. Default is False.
Returns:
best_nplets (torch.Tensor) – The n-plets with the best score found with shape (repeat, order).
best_scores (torch.Tensor) – The best scores for the best n-plets with shape (repeat,).
Notes
The function uses a greedy algorithm to iteratively find the best n-plets that maximize or minimize the specified metric.
The initial solutions are computed using the multi_order_measures function.
The function iterates over the remaining orders to get the best solution for each order.
Simulated annealing algorithm to find the best n-plets to maximize the metric for a given multivariate series or covariance matrices.
Parameters:
X (Union[np.ndarray, torch.Tensor, List[np.ndarray], List[torch.Tensor]]) – The input data to compute the n-plets. It can be a list of 2D numpy arrays or tensors of shape:
1. (T, N) where T is the number of samples if X are multivariate series.
2. A list of 2D covariance matrices with shape (N, N).
order (int, optional) – The order of the n-plets. If None, it will be set to N.
covmat_precomputed (bool, optional) – A boolean flag to indicate if the input data is a list of covariance matrices or multivariate series. Default is False.
T (int or list of int, optional) – A list of integers indicating the number of samples for each multivariate series. Default is None.
initial_solution (torch.Tensor, optional) – The initial solution with shape (repeat, order). If None, a random initial solution is generated.
repeat (int, optional) – The number of repetitions to do to obtain different solutions starting from less optimal initial solutions. Default is 10.
batch_size (int, optional) – The batch size to use for the computation. Default is 1,000,000.
device (torch.device, optional) – The device to use for the computation. Default is ‘cpu’.
max_iterations (int, optional) – The maximum number of iterations for the simulated annealing algorithm. Default is 1000.
early_stop (int, optional) – The number of iterations with no improvement to stop early. Default is 100.
initial_temp (float, optional) – The initial temperature for the simulated annealing algorithm. Default is 100.0.
cooling_rate (float, optional) – The cooling rate for the simulated annealing algorithm. Default is 0.99.
metric (Union[str, Callable], optional) – The metric to evaluate. One of ‘tc’, ‘dtc’, ‘o’, ‘s’ or a callable function. Default is ‘o’.
largest (bool, optional) – A flag to indicate if the metric is to be maximized or minimized. Default is False.
verbose (int, optional) – Logging verbosity level. Default is logging.INFO.
Returns:
best_solution (torch.Tensor) – The n-plets with the best score found with shape (repeat, order).
best_energy (torch.Tensor) – The best scores for the best n-plets with shape (repeat,).
Notes
The function uses a simulated annealing algorithm to iteratively find the best n-plets that maximize or minimize the specified metric.
The initial solutions are computed using the random_sampler function if not provided.
The function iterates over the remaining orders to get the best solution for each order.
Simulated annealing algorithm to find the best multi-order n-plets to maximize the metric for a given multivariate series or covariance matrices.
Parameters:
X (Union[np.ndarray, torch.Tensor, List[np.ndarray], List[torch.Tensor]]) – The input data to compute the n-plets. It can be a list of 2D numpy arrays or tensors of shape:
1. (T, N) where T is the number of samples if X are multivariate series.
2. A list of 2D covariance matrices with shape (N, N).
covmat_precomputed (bool, optional) – A boolean flag to indicate if the input data is a list of covariance matrices or multivariate series. Default is False.
T (int or list of int, optional) – A list of integers indicating the number of samples for each multivariate series. Default is None.
initial_solution (torch.Tensor, optional) – The initial solution with shape (repeat, N). If None, a random initial solution is generated.
repeat (int, optional) – The number of repetitions to do to obtain different solutions starting from less optimal initial solutions. Default is 10.
batch_size (int, optional) – The batch size to use for the computation. Default is 1,000,000.
device (torch.device, optional) – The device to use for the computation. Default is ‘cpu’.
max_iterations (int, optional) – The maximum number of iterations for the simulated annealing algorithm. Default is 1000.
early_stop (int, optional) – The number of iterations with no improvement to stop early. Default is 100.
initial_temp (float, optional) – The initial temperature for the simulated annealing algorithm. Default is 100.0.
cooling_rate (float, optional) – The cooling rate for the simulated annealing algorithm. Default is 0.99.
step_size (int, optional) – The number of elements to change in each step. Default is 1.
metric (Union[str, Callable], optional) – The metric to evaluate. One of ‘tc’, ‘dtc’, ‘o’, ‘s’ or a callable function. Default is ‘o’.
largest (bool, optional) – A flag to indicate if the metric is to be maximized or minimized. Default is False.
verbose (int, optional) – Logging verbosity level. Default is logging.INFO.
Returns:
best_solution (torch.Tensor) – The n-plets with the best score found with shape (repeat, N).
best_energy (torch.Tensor) – The best scores for the best n-plets with shape (repeat,).
Notes
The function uses a simulated annealing algorithm to iteratively find the best n-plets that maximize or minimize the specified metric.
The initial solutions are computed using the _random_solutions function if not provided.
The function iterates over the remaining orders to get the best solution for each order.