Command Line Interface¶

Commands¶

After installing entity-embed with pip, the commands entity_embed_train and entity_embed_predict are available.

Train Options¶

To check all the CLI options for train, run:

$ entity_embed_train --help

The mandatory ones are:

Option	Description
`--field_config_json`	Path of the JSON configuration file that defines how fields will be processed by the neural network. Check Field Types.
`--train_csv`	Path of the TRAIN input dataset CSV file.
`--valid_csv`	Path of the VALID input dataset CSV file.
`--test_csv`	Path of the TEST input dataset CSV file.
`--unlabeled_csv`	Path of the UNLABELED input dataset CSV file.
`--cluster_field`	Column of the CSV dataset that contains the true cluster assignment. Equivalent to the label in tabular classification. Files `train_csv`, `valid_csv`, `test_csv` MUST HAVE cluster_field column. File test_csv MUST NOT.
`--batch_size`	Training batch size, in CLUSTERS.
`--eval_batch_size`	Evaluation batch size, in RECORDS.
`--model_save_dir`	Directory path where to save the best validation model checkpoint using PyTorch Lightning.

If you’re doing Record Linkage, there are other mandatory options:

Option	Description
`--source_field`	Set this when doing Record Linkage. Column of the CSV dataset that contains the indication of the left or right source for Record Linkage.
`--left_source`	Set this when doing Record Linkage. Consider any record with this value in the `source_field` as the `left_source` dataset. The records with other `source_field` values are considered the right dataset.

Predict Options¶

To check all the CLI options for predict, run:

$ entity_embed_predict --help

The mandatory ones are:

Option	Description
`--model_save_filepath`	Path where the model checkpoint was saved. Get this from `entity_embed_train` output.
`--unlabeled_csv`	Path of the unlabeled input dataset CSV file.
`--eval_batch_size`	Evaluation batch size, in RECORDS.
`--sim_threshold`	A SINGLE cosine similarity threshold to use when finding duplicates. Any ANN neighbors with cosine similarity BELOW this threshold is ignored.
`--output_json`	Path of the output JSON file that will contain the candidate duplicate pairs. Remember Entity Embed is focused on recall. You must use some classifier to filter these and find the best matching pairs.

If you’re doing Record Linkage, there are other mandatory options:

Option	Description
`--source_field`	Set this when doing Record Linkage. Column of the CSV dataset that contains the indication of the left or right source for Record Linkage.
`--left_source`	Set this when doing Record Linkage. Consider any record with this value in the `source_field` as the `left_source` dataset. The records with other `source_field` values are considered the right dataset.

Examples¶

Clone entity-embed GitHub repo: git clone https://github.com/vintasoftware/entity-embed.git
cd into it
Check example data in example-data/
Back to entity-embed root dir, run one of the following:

Deduplication Train¶

$ entity_embed_train \
    --field_config_json example-data/er-field-config.json \
    --train_csv example-data/er-train.csv \
    --valid_csv example-data/er-valid.csv \
    --test_csv example-data/er-test.csv \
    --unlabeled_csv example-data/er-unlabeled.csv \
    --csv_encoding utf-8 \
    --cluster_field cluster \
    --embedding_size 300 \
    --lr 0.001 \
    --min_epochs 5 \
    --max_epochs 100 \
    --early_stop_monitor valid_recall_at_0.3 \
    --early_stop_min_delta 0 \
    --early_stop_patience 20 \
    --early_stop_mode max \
    --tb_save_dir tb_logs \
    --tb_name er-example \
    --check_val_every_n_epoch 1 \
    --batch_size 32 \
    --eval_batch_size 64 \
    --num_workers -1 \
    --multiprocessing_context fork \
    --sim_threshold 0.3 \
    --sim_threshold 0.5 \
    --sim_threshold 0.7 \
    --ann_k 100 \
    --m 64 \
    --max_m0 64 \
    --ef_construction 150 \
    --ef_search -1 \
    --random_seed 42 \
    --model_save_dir trained-models/er/ \
    --use_gpu 1

Deduplication Predict¶

$ entity_embed_predict \
    --model_save_filepath "trained-models/er/...fill-here..." \
    --unlabeled_csv example-data/er-unlabeled.csv \
    --csv_encoding utf-8 \
    --eval_batch_size 50 \
    --num_workers -1 \
    --multiprocessing_context fork \
    --sim_threshold 0.3 \
    --ann_k 100 \
    --m 64 \
    --max_m0 64 \
    --ef_construction 150 \
    --ef_search -1 \
    --random_seed 42 \
    --output_json example-data/er-prediction.json \
    --use_gpu 1

Record Linkage Train¶

$ entity_embed_train \
    --field_config_json example-data/rl-field-config.json \
    --train_csv example-data/rl-train.csv \
    --valid_csv example-data/rl-valid.csv \
    --test_csv example-data/rl-test.csv \
    --unlabeled_csv example-data/rl-unlabeled.csv \
    --csv_encoding utf-8 \
    --cluster_field cluster \
    --source_field __source \
    --left_source amazon \
    --embedding_size 300 \
    --lr 0.001 \
    --min_epochs 5 \
    --max_epochs 100 \
    --early_stop_monitor valid_recall_at_0.3 \
    --early_stop_min_delta 0 \
    --early_stop_patience 20 \
    --early_stop_mode max \
    --tb_save_dir tb_logs \
    --tb_name rl-example \
    --check_val_every_n_epoch 1 \
    --batch_size 32 \
    --eval_batch_size 64 \
    --num_workers -1 \
    --multiprocessing_context fork \
    --sim_threshold 0.3 \
    --sim_threshold 0.5 \
    --sim_threshold 0.7 \
    --ann_k 100 \
    --m 64 \
    --max_m0 64 \
    --ef_construction 150 \
    --ef_search -1 \
    --random_seed 42 \
    --model_save_dir trained-models/rl/ \
    --use_gpu 1

Record Linkage Predict¶

$ entity_embed_predict \
    --model_save_filepath "trained-models/rl/...fill-here..." \
    --unlabeled_csv example-data/rl-unlabeled.csv \
    --csv_encoding utf-8 \
    --source_field __source \
    --left_source amazon \
    --eval_batch_size 50 \
    --num_workers -1 \
    --multiprocessing_context fork \
    --sim_threshold 0.3 \
    --ann_k 100 \
    --m 64 \
    --max_m0 64 \
    --ef_construction 150 \
    --ef_search -1 \
    --random_seed 42 \
    --output_json example-data/rl-prediction.json \
    --use_gpu 1