entity_embed package¶
Subpackages¶
- entity_embed.benchmarks package
- Submodules
- entity_embed.benchmarks.abt_buy module
- entity_embed.benchmarks.amazon_google module
- entity_embed.benchmarks.base module
- entity_embed.benchmarks.beer module
- entity_embed.benchmarks.company module
- entity_embed.benchmarks.dblp_acm_structured module
- entity_embed.benchmarks.dblp_scholar_structured module
- entity_embed.benchmarks.fodors_zagats module
- entity_embed.benchmarks.itunes_amazon_structured module
- entity_embed.benchmarks.walmart_amazon_structured module
- Module contents
- entity_embed.data_utils package
Submodules¶
entity_embed.cli module¶
entity_embed.data_modules module¶
-
class
entity_embed.data_modules.
DeduplicationDataModule
(*args, **kwargs)¶ Bases:
pytorch_lightning.core.datamodule.LightningDataModule
-
setup
(stage=None)¶
-
test_dataloader
()¶ Implement one or multiple PyTorch DataLoaders for testing.
The dataloader you return will not be called every epoch unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_epoch` to
True
.For data processing use the following pattern:
download in
prepare_data()
process and split in
setup()
However, the above are only necessary for distributed processing.
Warning
do not assign state in prepare_data
fit()
…
prepare_data()
- Note:
Lightning adds the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.
- Return:
Single or multiple PyTorch DataLoaders.
Example:
def test_dataloader(self): transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (1.0,))]) dataset = MNIST(root='/path/to/mnist/', train=False, transform=transform, download=True) loader = torch.utils.data.DataLoader( dataset=dataset, batch_size=self.batch_size, shuffle=False ) return loader # can also return multiple dataloaders def test_dataloader(self): return [loader_a, loader_b, ..., loader_n]
- Note:
If you don’t need a test dataset and a
test_step()
, you don’t need to implement this method.- Note:
In the case where you return multiple test dataloaders, the
test_step()
will have an argumentdataloader_idx
which matches the order here.
-
train_dataloader
()¶ Implement one or more PyTorch DataLoaders for training.
- Return:
Either a single PyTorch
DataLoader
or a collection of these (list, dict, nested lists and dicts). In the case of multiple dataloaders, please see this page
The dataloader you return will not be called every epoch unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_epoch` to
True
.For data processing use the following pattern:
download in
prepare_data()
process and split in
setup()
However, the above are only necessary for distributed processing.
Warning
do not assign state in prepare_data
fit()
…
prepare_data()
- Note:
Lightning adds the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.
Example:
# single dataloader def train_dataloader(self): transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (1.0,))]) dataset = MNIST(root='/path/to/mnist/', train=True, transform=transform, download=True) loader = torch.utils.data.DataLoader( dataset=dataset, batch_size=self.batch_size, shuffle=True ) return loader # multiple dataloaders, return as list def train_dataloader(self): mnist = MNIST(...) cifar = CIFAR(...) mnist_loader = torch.utils.data.DataLoader( dataset=mnist, batch_size=self.batch_size, shuffle=True ) cifar_loader = torch.utils.data.DataLoader( dataset=cifar, batch_size=self.batch_size, shuffle=True ) # each batch will be a list of tensors: [batch_mnist, batch_cifar] return [mnist_loader, cifar_loader] # multiple dataloader, return as dict def train_dataloader(self): mnist = MNIST(...) cifar = CIFAR(...) mnist_loader = torch.utils.data.DataLoader( dataset=mnist, batch_size=self.batch_size, shuffle=True ) cifar_loader = torch.utils.data.DataLoader( dataset=cifar, batch_size=self.batch_size, shuffle=True ) # each batch will be a dict of tensors: {'mnist': batch_mnist, 'cifar': batch_cifar} return {'mnist': mnist_loader, 'cifar': cifar_loader}
-
val_dataloader
()¶ Implement one or multiple PyTorch DataLoaders for validation.
The dataloader you return will not be called every epoch unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_epoch` to
True
.It’s recommended that all data downloads and preparation happen in
prepare_data()
.fit()
…
prepare_data()
- Note:
Lightning adds the correct sampler for distributed and arbitrary hardware There is no need to set it yourself.
- Return:
Single or multiple PyTorch DataLoaders.
Examples:
def val_dataloader(self): transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (1.0,))]) dataset = MNIST(root='/path/to/mnist/', train=False, transform=transform, download=True) loader = torch.utils.data.DataLoader( dataset=dataset, batch_size=self.batch_size, shuffle=False ) return loader # can also return multiple dataloaders def val_dataloader(self): return [loader_a, loader_b, ..., loader_n]
- Note:
If you don’t need a validation dataset and a
validation_step()
, you don’t need to implement this method.- Note:
In the case where you return multiple validation dataloaders, the
validation_step()
will have an argumentdataloader_idx
which matches the order here.
-
-
class
entity_embed.data_modules.
LinkageDataModule
(*args, **kwargs)¶ Bases:
pytorch_lightning.core.datamodule.LightningDataModule
-
setup
(stage=None)¶
-
test_dataloader
()¶ Implement one or multiple PyTorch DataLoaders for testing.
The dataloader you return will not be called every epoch unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_epoch` to
True
.For data processing use the following pattern:
download in
prepare_data()
process and split in
setup()
However, the above are only necessary for distributed processing.
Warning
do not assign state in prepare_data
fit()
…
prepare_data()
- Note:
Lightning adds the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.
- Return:
Single or multiple PyTorch DataLoaders.
Example:
def test_dataloader(self): transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (1.0,))]) dataset = MNIST(root='/path/to/mnist/', train=False, transform=transform, download=True) loader = torch.utils.data.DataLoader( dataset=dataset, batch_size=self.batch_size, shuffle=False ) return loader # can also return multiple dataloaders def test_dataloader(self): return [loader_a, loader_b, ..., loader_n]
- Note:
If you don’t need a test dataset and a
test_step()
, you don’t need to implement this method.- Note:
In the case where you return multiple test dataloaders, the
test_step()
will have an argumentdataloader_idx
which matches the order here.
-
train_dataloader
()¶ Implement one or more PyTorch DataLoaders for training.
- Return:
Either a single PyTorch
DataLoader
or a collection of these (list, dict, nested lists and dicts). In the case of multiple dataloaders, please see this page
The dataloader you return will not be called every epoch unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_epoch` to
True
.For data processing use the following pattern:
download in
prepare_data()
process and split in
setup()
However, the above are only necessary for distributed processing.
Warning
do not assign state in prepare_data
fit()
…
prepare_data()
- Note:
Lightning adds the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.
Example:
# single dataloader def train_dataloader(self): transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (1.0,))]) dataset = MNIST(root='/path/to/mnist/', train=True, transform=transform, download=True) loader = torch.utils.data.DataLoader( dataset=dataset, batch_size=self.batch_size, shuffle=True ) return loader # multiple dataloaders, return as list def train_dataloader(self): mnist = MNIST(...) cifar = CIFAR(...) mnist_loader = torch.utils.data.DataLoader( dataset=mnist, batch_size=self.batch_size, shuffle=True ) cifar_loader = torch.utils.data.DataLoader( dataset=cifar, batch_size=self.batch_size, shuffle=True ) # each batch will be a list of tensors: [batch_mnist, batch_cifar] return [mnist_loader, cifar_loader] # multiple dataloader, return as dict def train_dataloader(self): mnist = MNIST(...) cifar = CIFAR(...) mnist_loader = torch.utils.data.DataLoader( dataset=mnist, batch_size=self.batch_size, shuffle=True ) cifar_loader = torch.utils.data.DataLoader( dataset=cifar, batch_size=self.batch_size, shuffle=True ) # each batch will be a dict of tensors: {'mnist': batch_mnist, 'cifar': batch_cifar} return {'mnist': mnist_loader, 'cifar': cifar_loader}
-
val_dataloader
()¶ Implement one or multiple PyTorch DataLoaders for validation.
The dataloader you return will not be called every epoch unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_epoch` to
True
.It’s recommended that all data downloads and preparation happen in
prepare_data()
.fit()
…
prepare_data()
- Note:
Lightning adds the correct sampler for distributed and arbitrary hardware There is no need to set it yourself.
- Return:
Single or multiple PyTorch DataLoaders.
Examples:
def val_dataloader(self): transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (1.0,))]) dataset = MNIST(root='/path/to/mnist/', train=False, transform=transform, download=True) loader = torch.utils.data.DataLoader( dataset=dataset, batch_size=self.batch_size, shuffle=False ) return loader # can also return multiple dataloaders def val_dataloader(self): return [loader_a, loader_b, ..., loader_n]
- Note:
If you don’t need a validation dataset and a
validation_step()
, you don’t need to implement this method.- Note:
In the case where you return multiple validation dataloaders, the
validation_step()
will have an argumentdataloader_idx
which matches the order here.
-
entity_embed.early_stopping module¶
-
class
entity_embed.early_stopping.
EarlyStoppingMinEpochs
(min_epochs, monitor, patience, mode, min_delta=0.0, verbose=False, strict=True)¶ Bases:
pytorch_lightning.callbacks.early_stopping.EarlyStopping
-
on_validation_end
(trainer, pl_module)¶ Called when the validation loop ends.
-
-
class
entity_embed.early_stopping.
ModelCheckpointMinEpochs
(min_epochs, monitor, mode, dirpath=None, filename=None, verbose=False, save_last=None, save_top_k=None, save_weights_only=False, period=1, prefix='')¶ Bases:
pytorch_lightning.callbacks.model_checkpoint.ModelCheckpoint
-
on_validation_end
(trainer, pl_module)¶ checkpoints can be saved at the end of the val loop
-
entity_embed.entity_embed module¶
-
class
entity_embed.entity_embed.
EntityEmbed
(record_numericalizer, embedding_size=300, loss_cls=<class 'pytorch_metric_learning.losses.supcon_loss.SupConLoss'>, loss_kwargs=None, miner_cls=None, miner_kwargs=None, optimizer_cls=<class 'torch.optim.adam.Adam'>, learning_rate=0.001, optimizer_kwargs=None, ann_k=10, sim_threshold_list=[0.3, 0.5, 0.7], index_build_kwargs=None, index_search_kwargs=None, **kwargs)¶ Bases:
entity_embed.entity_embed._BaseEmbed
-
predict_pairs
(record_dict, batch_size, ann_k, sim_threshold, loader_kwargs=None, index_build_kwargs=None, index_search_kwargs=None, show_progress=True, return_field_embeddings=False)¶
-
training
: bool¶
-
-
class
entity_embed.entity_embed.
LinkageEmbed
(record_numericalizer, source_field, left_source, **kwargs)¶ Bases:
entity_embed.entity_embed._BaseEmbed
-
predict
(record_dict, batch_size, loader_kwargs=None, show_progress=True, return_field_embeddings=False)¶ Use this function with trainer.predict(…). Override if you need to add any processing logic.
-
predict_pairs
(record_dict, batch_size, ann_k, sim_threshold, loader_kwargs=None, index_build_kwargs=None, index_search_kwargs=None, show_progress=True, return_field_embeddings=False)¶
-
training
: bool¶
-
entity_embed.evaluation module¶
-
entity_embed.evaluation.
evaluate_output_json
(unlabeled_csv_filepath, output_json_filepath, pos_pair_json_filepath, csv_encoding='utf-8')¶
-
entity_embed.evaluation.
f1_score
(precision, recall)¶
-
entity_embed.evaluation.
pair_entity_ratio
(found_pair_set_len, entity_count)¶
-
entity_embed.evaluation.
precision_and_recall
(found_pair_set, pos_pair_set, neg_pair_set=None)¶
entity_embed.helpers module¶
-
entity_embed.helpers.
build_index_build_kwargs
(kwargs_dict=None)¶
-
entity_embed.helpers.
build_index_search_kwargs
(kwargs_dict=None)¶
-
entity_embed.helpers.
build_loader_kwargs
(kwargs_dict=None)¶
entity_embed.indexes module¶
entity_embed.models module¶
-
class
entity_embed.models.
BlockerNet
(field_config_dict, embedding_size=300)¶ Bases:
torch.nn.modules.module.Module
-
fix_pool_weights
()¶ Force pool weights between 0 and 1 and total sum as 1.
-
forward
(tensor_dict, sequence_length_dict, return_field_embeddings=False)¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
get_pool_weights
()¶
-
training
: bool¶
-
-
class
entity_embed.models.
EntityAvgPoolNet
(field_config_dict, embedding_size)¶ Bases:
torch.nn.modules.module.Module
-
forward
(field_embedding_dict, sequence_length_dict)¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
training
: bool¶
-
-
class
entity_embed.models.
FieldsEmbedNet
(field_config_dict, embedding_size)¶ Bases:
torch.nn.modules.module.Module
-
forward
(tensor_dict, sequence_length_dict)¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
training
: bool¶
-
-
class
entity_embed.models.
MaskedAttention
(embedding_size)¶ Bases:
torch.nn.modules.module.Module
PyTorch nn.Module of an Attention mechanism for weighted averging of hidden states produced by a RNN. Based on mechanisms discussed in “Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm (EMNLP 17)” (code at https://github.com/huggingface/torchMoji) and “AutoBlock: A Hands-off Blocking Framework for Entity Matching (WSDM 20)”.
-
forward
(h, x, sequence_lengths, **kwargs)¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
training
: bool¶
-
-
class
entity_embed.models.
MultitokenAttentionEmbed
(embed_net)¶ Bases:
torch.nn.modules.module.Module
-
forward
(x, sequence_lengths, **kwargs)¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
training
: bool¶
-
-
class
entity_embed.models.
MultitokenAvgEmbed
(embed_net)¶ Bases:
torch.nn.modules.module.Module
-
forward
(x, sequence_lengths, **kwargs)¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
training
: bool¶
-
-
class
entity_embed.models.
SemanticEmbedNet
(field_config, embedding_size)¶ Bases:
torch.nn.modules.module.Module
-
forward
(x, **kwargs)¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
training
: bool¶
-
-
class
entity_embed.models.
StringEmbedCNN
(field_config, embedding_size)¶ Bases:
torch.nn.modules.module.Module
PyTorch nn.Module for embedding strings for fast edit distance computation, based on “Convolutional Embedding for Edit Distance (SIGIR 20)” (code: https://github.com/xinyandai/string-embed)
The tensor shape expected here is produced by StringNumericalizer.
-
forward
(x, **kwargs)¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
training
: bool¶
-
Module contents¶
Top-level package for entity-embed.