entity_embed package

Submodules

entity_embed.cli module

entity_embed.data_modules module

class entity_embed.data_modules.DeduplicationDataModule(*args, **kwargs)

Bases: pytorch_lightning.core.datamodule.LightningDataModule

setup(stage=None)
test_dataloader()

Implement one or multiple PyTorch DataLoaders for testing.

The dataloader you return will not be called every epoch unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_epoch` to True.

For data processing use the following pattern:

  • download in prepare_data()

  • process and split in setup()

However, the above are only necessary for distributed processing.

Warning

do not assign state in prepare_data

Note:

Lightning adds the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.

Return:

Single or multiple PyTorch DataLoaders.

Example:

def test_dataloader(self):
    transform = transforms.Compose([transforms.ToTensor(),
                                    transforms.Normalize((0.5,), (1.0,))])
    dataset = MNIST(root='/path/to/mnist/', train=False, transform=transform,
                    download=True)
    loader = torch.utils.data.DataLoader(
        dataset=dataset,
        batch_size=self.batch_size,
        shuffle=False
    )

    return loader

# can also return multiple dataloaders
def test_dataloader(self):
    return [loader_a, loader_b, ..., loader_n]
Note:

If you don’t need a test dataset and a test_step(), you don’t need to implement this method.

Note:

In the case where you return multiple test dataloaders, the test_step() will have an argument dataloader_idx which matches the order here.

train_dataloader()

Implement one or more PyTorch DataLoaders for training.

Return:

Either a single PyTorch DataLoader or a collection of these (list, dict, nested lists and dicts). In the case of multiple dataloaders, please see this page

The dataloader you return will not be called every epoch unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_epoch` to True.

For data processing use the following pattern:

  • download in prepare_data()

  • process and split in setup()

However, the above are only necessary for distributed processing.

Warning

do not assign state in prepare_data

Note:

Lightning adds the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.

Example:

# single dataloader
def train_dataloader(self):
    transform = transforms.Compose([transforms.ToTensor(),
                                    transforms.Normalize((0.5,), (1.0,))])
    dataset = MNIST(root='/path/to/mnist/', train=True, transform=transform,
                    download=True)
    loader = torch.utils.data.DataLoader(
        dataset=dataset,
        batch_size=self.batch_size,
        shuffle=True
    )
    return loader

# multiple dataloaders, return as list
def train_dataloader(self):
    mnist = MNIST(...)
    cifar = CIFAR(...)
    mnist_loader = torch.utils.data.DataLoader(
        dataset=mnist, batch_size=self.batch_size, shuffle=True
    )
    cifar_loader = torch.utils.data.DataLoader(
        dataset=cifar, batch_size=self.batch_size, shuffle=True
    )
    # each batch will be a list of tensors: [batch_mnist, batch_cifar]
    return [mnist_loader, cifar_loader]

# multiple dataloader, return as dict
def train_dataloader(self):
    mnist = MNIST(...)
    cifar = CIFAR(...)
    mnist_loader = torch.utils.data.DataLoader(
        dataset=mnist, batch_size=self.batch_size, shuffle=True
    )
    cifar_loader = torch.utils.data.DataLoader(
        dataset=cifar, batch_size=self.batch_size, shuffle=True
    )
    # each batch will be a dict of tensors: {'mnist': batch_mnist, 'cifar': batch_cifar}
    return {'mnist': mnist_loader, 'cifar': cifar_loader}
val_dataloader()

Implement one or multiple PyTorch DataLoaders for validation.

The dataloader you return will not be called every epoch unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_epoch` to True.

It’s recommended that all data downloads and preparation happen in prepare_data().

Note:

Lightning adds the correct sampler for distributed and arbitrary hardware There is no need to set it yourself.

Return:

Single or multiple PyTorch DataLoaders.

Examples:

def val_dataloader(self):
    transform = transforms.Compose([transforms.ToTensor(),
                                    transforms.Normalize((0.5,), (1.0,))])
    dataset = MNIST(root='/path/to/mnist/', train=False,
                    transform=transform, download=True)
    loader = torch.utils.data.DataLoader(
        dataset=dataset,
        batch_size=self.batch_size,
        shuffle=False
    )

    return loader

# can also return multiple dataloaders
def val_dataloader(self):
    return [loader_a, loader_b, ..., loader_n]
Note:

If you don’t need a validation dataset and a validation_step(), you don’t need to implement this method.

Note:

In the case where you return multiple validation dataloaders, the validation_step() will have an argument dataloader_idx which matches the order here.

class entity_embed.data_modules.LinkageDataModule(*args, **kwargs)

Bases: pytorch_lightning.core.datamodule.LightningDataModule

setup(stage=None)
test_dataloader()

Implement one or multiple PyTorch DataLoaders for testing.

The dataloader you return will not be called every epoch unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_epoch` to True.

For data processing use the following pattern:

  • download in prepare_data()

  • process and split in setup()

However, the above are only necessary for distributed processing.

Warning

do not assign state in prepare_data

Note:

Lightning adds the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.

Return:

Single or multiple PyTorch DataLoaders.

Example:

def test_dataloader(self):
    transform = transforms.Compose([transforms.ToTensor(),
                                    transforms.Normalize((0.5,), (1.0,))])
    dataset = MNIST(root='/path/to/mnist/', train=False, transform=transform,
                    download=True)
    loader = torch.utils.data.DataLoader(
        dataset=dataset,
        batch_size=self.batch_size,
        shuffle=False
    )

    return loader

# can also return multiple dataloaders
def test_dataloader(self):
    return [loader_a, loader_b, ..., loader_n]
Note:

If you don’t need a test dataset and a test_step(), you don’t need to implement this method.

Note:

In the case where you return multiple test dataloaders, the test_step() will have an argument dataloader_idx which matches the order here.

train_dataloader()

Implement one or more PyTorch DataLoaders for training.

Return:

Either a single PyTorch DataLoader or a collection of these (list, dict, nested lists and dicts). In the case of multiple dataloaders, please see this page

The dataloader you return will not be called every epoch unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_epoch` to True.

For data processing use the following pattern:

  • download in prepare_data()

  • process and split in setup()

However, the above are only necessary for distributed processing.

Warning

do not assign state in prepare_data

Note:

Lightning adds the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.

Example:

# single dataloader
def train_dataloader(self):
    transform = transforms.Compose([transforms.ToTensor(),
                                    transforms.Normalize((0.5,), (1.0,))])
    dataset = MNIST(root='/path/to/mnist/', train=True, transform=transform,
                    download=True)
    loader = torch.utils.data.DataLoader(
        dataset=dataset,
        batch_size=self.batch_size,
        shuffle=True
    )
    return loader

# multiple dataloaders, return as list
def train_dataloader(self):
    mnist = MNIST(...)
    cifar = CIFAR(...)
    mnist_loader = torch.utils.data.DataLoader(
        dataset=mnist, batch_size=self.batch_size, shuffle=True
    )
    cifar_loader = torch.utils.data.DataLoader(
        dataset=cifar, batch_size=self.batch_size, shuffle=True
    )
    # each batch will be a list of tensors: [batch_mnist, batch_cifar]
    return [mnist_loader, cifar_loader]

# multiple dataloader, return as dict
def train_dataloader(self):
    mnist = MNIST(...)
    cifar = CIFAR(...)
    mnist_loader = torch.utils.data.DataLoader(
        dataset=mnist, batch_size=self.batch_size, shuffle=True
    )
    cifar_loader = torch.utils.data.DataLoader(
        dataset=cifar, batch_size=self.batch_size, shuffle=True
    )
    # each batch will be a dict of tensors: {'mnist': batch_mnist, 'cifar': batch_cifar}
    return {'mnist': mnist_loader, 'cifar': cifar_loader}
val_dataloader()

Implement one or multiple PyTorch DataLoaders for validation.

The dataloader you return will not be called every epoch unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_epoch` to True.

It’s recommended that all data downloads and preparation happen in prepare_data().

Note:

Lightning adds the correct sampler for distributed and arbitrary hardware There is no need to set it yourself.

Return:

Single or multiple PyTorch DataLoaders.

Examples:

def val_dataloader(self):
    transform = transforms.Compose([transforms.ToTensor(),
                                    transforms.Normalize((0.5,), (1.0,))])
    dataset = MNIST(root='/path/to/mnist/', train=False,
                    transform=transform, download=True)
    loader = torch.utils.data.DataLoader(
        dataset=dataset,
        batch_size=self.batch_size,
        shuffle=False
    )

    return loader

# can also return multiple dataloaders
def val_dataloader(self):
    return [loader_a, loader_b, ..., loader_n]
Note:

If you don’t need a validation dataset and a validation_step(), you don’t need to implement this method.

Note:

In the case where you return multiple validation dataloaders, the validation_step() will have an argument dataloader_idx which matches the order here.

entity_embed.early_stopping module

class entity_embed.early_stopping.EarlyStoppingMinEpochs(min_epochs, monitor, patience, mode, min_delta=0.0, verbose=False, strict=True)

Bases: pytorch_lightning.callbacks.early_stopping.EarlyStopping

on_validation_end(trainer, pl_module)

Called when the validation loop ends.

class entity_embed.early_stopping.ModelCheckpointMinEpochs(min_epochs, monitor, mode, dirpath=None, filename=None, verbose=False, save_last=None, save_top_k=None, save_weights_only=False, period=1, prefix='')

Bases: pytorch_lightning.callbacks.model_checkpoint.ModelCheckpoint

on_validation_end(trainer, pl_module)

checkpoints can be saved at the end of the val loop

entity_embed.entity_embed module

class entity_embed.entity_embed.EntityEmbed(record_numericalizer, embedding_size=300, loss_cls=<class 'pytorch_metric_learning.losses.supcon_loss.SupConLoss'>, loss_kwargs=None, miner_cls=None, miner_kwargs=None, optimizer_cls=<class 'torch.optim.adam.Adam'>, learning_rate=0.001, optimizer_kwargs=None, ann_k=10, sim_threshold_list=[0.3, 0.5, 0.7], index_build_kwargs=None, index_search_kwargs=None, **kwargs)

Bases: entity_embed.entity_embed._BaseEmbed

predict_pairs(record_dict, batch_size, ann_k, sim_threshold, loader_kwargs=None, index_build_kwargs=None, index_search_kwargs=None, show_progress=True, return_field_embeddings=False)
training: bool
class entity_embed.entity_embed.LinkageEmbed(record_numericalizer, source_field, left_source, **kwargs)

Bases: entity_embed.entity_embed._BaseEmbed

predict(record_dict, batch_size, loader_kwargs=None, show_progress=True, return_field_embeddings=False)

Use this function with trainer.predict(…). Override if you need to add any processing logic.

predict_pairs(record_dict, batch_size, ann_k, sim_threshold, loader_kwargs=None, index_build_kwargs=None, index_search_kwargs=None, show_progress=True, return_field_embeddings=False)
training: bool

entity_embed.evaluation module

entity_embed.evaluation.evaluate_output_json(unlabeled_csv_filepath, output_json_filepath, pos_pair_json_filepath, csv_encoding='utf-8')
entity_embed.evaluation.f1_score(precision, recall)
entity_embed.evaluation.pair_entity_ratio(found_pair_set_len, entity_count)
entity_embed.evaluation.precision_and_recall(found_pair_set, pos_pair_set, neg_pair_set=None)

entity_embed.helpers module

entity_embed.helpers.build_index_build_kwargs(kwargs_dict=None)
entity_embed.helpers.build_index_search_kwargs(kwargs_dict=None)
entity_embed.helpers.build_loader_kwargs(kwargs_dict=None)

entity_embed.indexes module

class entity_embed.indexes.ANNEntityIndex(embedding_size)

Bases: object

build(index_build_kwargs=None)
insert_vector_dict(vector_dict)
search_pairs(k, sim_threshold, index_search_kwargs=None)
class entity_embed.indexes.ANNLinkageIndex(embedding_size)

Bases: object

build(index_build_kwargs=None)
insert_vector_dict(left_vector_dict, right_vector_dict)
search_pairs(k, sim_threshold, left_vector_dict, right_vector_dict, left_source, index_search_kwargs=None)

entity_embed.models module

class entity_embed.models.BlockerNet(field_config_dict, embedding_size=300)

Bases: torch.nn.modules.module.Module

fix_pool_weights()

Force pool weights between 0 and 1 and total sum as 1.

forward(tensor_dict, sequence_length_dict, return_field_embeddings=False)

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

get_pool_weights()
training: bool
class entity_embed.models.EntityAvgPoolNet(field_config_dict, embedding_size)

Bases: torch.nn.modules.module.Module

forward(field_embedding_dict, sequence_length_dict)

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

training: bool
class entity_embed.models.FieldsEmbedNet(field_config_dict, embedding_size)

Bases: torch.nn.modules.module.Module

forward(tensor_dict, sequence_length_dict)

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

training: bool
class entity_embed.models.MaskedAttention(embedding_size)

Bases: torch.nn.modules.module.Module

PyTorch nn.Module of an Attention mechanism for weighted averging of hidden states produced by a RNN. Based on mechanisms discussed in “Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm (EMNLP 17)” (code at https://github.com/huggingface/torchMoji) and “AutoBlock: A Hands-off Blocking Framework for Entity Matching (WSDM 20)”.

forward(h, x, sequence_lengths, **kwargs)

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

training: bool
class entity_embed.models.MultitokenAttentionEmbed(embed_net)

Bases: torch.nn.modules.module.Module

forward(x, sequence_lengths, **kwargs)

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

training: bool
class entity_embed.models.MultitokenAvgEmbed(embed_net)

Bases: torch.nn.modules.module.Module

forward(x, sequence_lengths, **kwargs)

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

training: bool
class entity_embed.models.SemanticEmbedNet(field_config, embedding_size)

Bases: torch.nn.modules.module.Module

forward(x, **kwargs)

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

training: bool
class entity_embed.models.StringEmbedCNN(field_config, embedding_size)

Bases: torch.nn.modules.module.Module

PyTorch nn.Module for embedding strings for fast edit distance computation, based on “Convolutional Embedding for Edit Distance (SIGIR 20)” (code: https://github.com/xinyandai/string-embed)

The tensor shape expected here is produced by StringNumericalizer.

forward(x, **kwargs)

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

training: bool

Module contents

Top-level package for entity-embed.