Command Line Interface

Commands

After installing entity-embed with pip, the commands entity_embed_train and entity_embed_predict are available.

Train Options

To check all the CLI options for train, run:

$ entity_embed_train --help

The mandatory ones are:

Option

Description

--field_config_json

Path of the JSON configuration file that defines how fields will be processed by the neural network. Check Field Types.

--train_csv

Path of the TRAIN input dataset CSV file.

--valid_csv

Path of the VALID input dataset CSV file.

--test_csv

Path of the TEST input dataset CSV file.

--unlabeled_csv

Path of the UNLABELED input dataset CSV file.

--cluster_field

Column of the CSV dataset that contains the true cluster assignment. Equivalent to the label in tabular classification. Files train_csv, valid_csv, test_csv MUST HAVE cluster_field column. File test_csv MUST NOT.

--batch_size

Training batch size, in CLUSTERS.

--eval_batch_size

Evaluation batch size, in RECORDS.

--model_save_dir

Directory path where to save the best validation model checkpoint using PyTorch Lightning.

If you’re doing Record Linkage, there are other mandatory options:

Option

Description

--source_field

Set this when doing Record Linkage. Column of the CSV dataset that contains the indication of the left or right source for Record Linkage.

--left_source

Set this when doing Record Linkage. Consider any record with this value in the source_field as the left_source dataset. The records with other source_field values are considered the right dataset.

Predict Options

To check all the CLI options for predict, run:

$ entity_embed_predict --help

The mandatory ones are:

Option

Description

--model_save_filepath

Path where the model checkpoint was saved. Get this from entity_embed_train output.

--unlabeled_csv

Path of the unlabeled input dataset CSV file.

--eval_batch_size

Evaluation batch size, in RECORDS.

--sim_threshold

A SINGLE cosine similarity threshold to use when finding duplicates. Any ANN neighbors with cosine similarity BELOW this threshold is ignored.

--output_json

Path of the output JSON file that will contain the candidate duplicate pairs. Remember Entity Embed is focused on recall. You must use some classifier to filter these and find the best matching pairs.

If you’re doing Record Linkage, there are other mandatory options:

Option

Description

--source_field

Set this when doing Record Linkage. Column of the CSV dataset that contains the indication of the left or right source for Record Linkage.

--left_source

Set this when doing Record Linkage. Consider any record with this value in the source_field as the left_source dataset. The records with other source_field values are considered the right dataset.

Examples

  1. Clone entity-embed GitHub repo: git clone https://github.com/vintasoftware/entity-embed.git

  2. cd into it

  3. Check example data in example-data/

  4. Back to entity-embed root dir, run one of the following:

Deduplication Train

$ entity_embed_train \
    --field_config_json example-data/er-field-config.json \
    --train_csv example-data/er-train.csv \
    --valid_csv example-data/er-valid.csv \
    --test_csv example-data/er-test.csv \
    --unlabeled_csv example-data/er-unlabeled.csv \
    --csv_encoding utf-8 \
    --cluster_field cluster \
    --embedding_size 300 \
    --lr 0.001 \
    --min_epochs 5 \
    --max_epochs 100 \
    --early_stop_monitor valid_recall_at_0.3 \
    --early_stop_min_delta 0 \
    --early_stop_patience 20 \
    --early_stop_mode max \
    --tb_save_dir tb_logs \
    --tb_name er-example \
    --check_val_every_n_epoch 1 \
    --batch_size 32 \
    --eval_batch_size 64 \
    --num_workers -1 \
    --multiprocessing_context fork \
    --sim_threshold 0.3 \
    --sim_threshold 0.5 \
    --sim_threshold 0.7 \
    --ann_k 100 \
    --m 64 \
    --max_m0 64 \
    --ef_construction 150 \
    --ef_search -1 \
    --random_seed 42 \
    --model_save_dir trained-models/er/ \
    --use_gpu 1

Deduplication Predict

$ entity_embed_predict \
    --model_save_filepath "trained-models/er/...fill-here..." \
    --unlabeled_csv example-data/er-unlabeled.csv \
    --csv_encoding utf-8 \
    --eval_batch_size 50 \
    --num_workers -1 \
    --multiprocessing_context fork \
    --sim_threshold 0.3 \
    --ann_k 100 \
    --m 64 \
    --max_m0 64 \
    --ef_construction 150 \
    --ef_search -1 \
    --random_seed 42 \
    --output_json example-data/er-prediction.json \
    --use_gpu 1

Record Linkage Train

$ entity_embed_train \
    --field_config_json example-data/rl-field-config.json \
    --train_csv example-data/rl-train.csv \
    --valid_csv example-data/rl-valid.csv \
    --test_csv example-data/rl-test.csv \
    --unlabeled_csv example-data/rl-unlabeled.csv \
    --csv_encoding utf-8 \
    --cluster_field cluster \
    --source_field __source \
    --left_source amazon \
    --embedding_size 300 \
    --lr 0.001 \
    --min_epochs 5 \
    --max_epochs 100 \
    --early_stop_monitor valid_recall_at_0.3 \
    --early_stop_min_delta 0 \
    --early_stop_patience 20 \
    --early_stop_mode max \
    --tb_save_dir tb_logs \
    --tb_name rl-example \
    --check_val_every_n_epoch 1 \
    --batch_size 32 \
    --eval_batch_size 64 \
    --num_workers -1 \
    --multiprocessing_context fork \
    --sim_threshold 0.3 \
    --sim_threshold 0.5 \
    --sim_threshold 0.7 \
    --ann_k 100 \
    --m 64 \
    --max_m0 64 \
    --ef_construction 150 \
    --ef_search -1 \
    --random_seed 42 \
    --model_save_dir trained-models/rl/ \
    --use_gpu 1

Record Linkage Predict

$ entity_embed_predict \
    --model_save_filepath "trained-models/rl/...fill-here..." \
    --unlabeled_csv example-data/rl-unlabeled.csv \
    --csv_encoding utf-8 \
    --source_field __source \
    --left_source amazon \
    --eval_batch_size 50 \
    --num_workers -1 \
    --multiprocessing_context fork \
    --sim_threshold 0.3 \
    --ann_k 100 \
    --m 64 \
    --max_m0 64 \
    --ef_construction 150 \
    --ef_search -1 \
    --random_seed 42 \
    --output_json example-data/rl-prediction.json \
    --use_gpu 1