NeuroBERT-NER: Lightweight Named Entity Recognition for Edge AI

Author by BoltUIX Team in AI & Machine Learning June 12, 2025 92
NeuroBERT-NER Banner

Overview

boltuix/NeuroBERT-NER is a fine-tuned transformer model for Named Entity Recognition (NER), built on boltuix/NeuroBERT-Mini. It identifies 36 entity types (e.g., PERSON, GPE, ORG, DATE) in English text, optimized for edge AI, IoT, and applications like information extraction, chatbots, and knowledge graphs. With a size of ~50MB and ~11M parameters, it offers low-latency, offline processing for privacy-first environments.

NeuroBERT-NER delivers high-accuracy NER with lightweight efficiency for edge devices.

BoltUIX Team, AI Innovation 2025

Model Details

  • Dataset: boltuix/conll2025-ner (143,709 entries, 6.38 MB) (Placeholder URL, update when available)
  • Entity Types: 36 NER tags (18 categories with B-/I- tags + O)
  • Splits: Train (~115,812), Validation (~15,680), Test (~12,217)
  • Domains: News, user-generated content, research corpora
  • Tasks: Sentence-level and document-level NER
  • Version: v1.1
  • Developer: Boltuix
  • License: Apache-2.0
  • Parameters: ~11M

Use Cases

Direct Applications

  • Information Extraction: Extract entities like πŸ‘€ PERSON, 🌍 GPE, πŸ—“οΈ DATE from news and reports.
  • Chatbots & Assistants: Enhance query understanding with entity recognition.
  • Search Enhancement: Enable semantic search with entity-based indexing.
  • Knowledge Graphs: Build graphs linking 🏒 ORG and πŸ‘€ PERSON.

Downstream Tasks

  • Domain Adaptation: Fine-tune for 🩺 medical, πŸ“œ legal, or πŸ’Έ financial NER.
  • Multilingual Extensions: Retrain for non-English languages.
  • Custom Entities: Adapt for stock tickers, product SKUs, etc.

Limitations

  • English-Only: Limited to English text.
  • Domain Bias: May underperform on informal or code-mixed text.
  • Generalization: Limited for low-resource entities.
NeuroBERT-NER Applications

Getting Started

Inference Code

Perform NER with the following Python code:


from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("boltuix/NeuroBERT-NER")
model = AutoModelForTokenClassification.from_pretrained("boltuix/NeuroBERT-NER")

# Input text
text = "Barack Obama visited Microsoft headquarters in Seattle on January 2025."
inputs = tokenizer(text, return_tensors="pt")

# Run inference
with torch.no_grad():
    outputs = model(**inputs)
predictions = outputs.logits.argmax(dim=-1)

# Map predictions to labels
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
label_map = model.config.id2label
labels = [label_map[p.item()] for p in predictions[0]]

# Print results
for token, label in zip(tokens, labels):
    if token not in tokenizer.all_special_tokens:
        print(f"{token:15} β†’ {label}")
                        

Example Output:
Barack β†’ B-PERSON
Obama β†’ I-PERSON
visited β†’ O
Microsoft β†’ B-ORG
headquarters β†’ O
in β†’ O
Seattle β†’ B-GPE
on β†’ O
January β†’ B-DATE
2025 β†’ I-DATE
. β†’ O

Requirements


pip install transformers torch pandas pyarrow seqeval
                        

Python: 3.8+ | Storage: ~50 MB | Optional: CUDA for GPU

Entity Labels

Supports 36 NER tags using the BIO scheme:

Tag NamePurposeEmoji
OOutside of any named entity🚫
B-CARDINALBeginning of a cardinal numberπŸ”’
B-DATEBeginning of a dateπŸ—“οΈ
B-EVENTBeginning of an eventπŸŽ‰
B-FACBeginning of a facilityπŸ›οΈ
B-GPEBeginning of a geopolitical entity🌍
B-LANGUAGEBeginning of a languageπŸ—£οΈ
B-LAWBeginning of a law/legal documentπŸ“œ
B-LOCBeginning of a non-GPE locationπŸ—ΊοΈ
B-MONEYBeginning of a monetary valueπŸ’Έ
B-NORPBeginning of a nationality/religious/political group🏳️
B-ORDINALBeginning of an ordinal numberπŸ₯‡
B-ORGBeginning of an organization🏒
B-PERCENTBeginning of a percentageπŸ“Š
B-PERSONBeginning of a person’s nameπŸ‘€
B-PRODUCTBeginning of a productπŸ“±
B-QUANTITYBeginning of a quantityβš–οΈ
B-TIMEBeginning of a time⏰
B-WORK_OF_ARTBeginning of a work of art🎨
I-CARDINALInside of a cardinal numberπŸ”’
I-DATEInside of a dateπŸ—“οΈ
I-EVENTInside of an eventπŸŽ‰
I-FACInside of a facilityπŸ›οΈ
I-GPEInside of a geopolitical entity🌍
I-LANGUAGEInside of a languageπŸ—£οΈ
I-LAWInside of a legal documentπŸ“œ
I-LOCInside of a locationπŸ—ΊοΈ
I-MONEYInside of a monetary valueπŸ’Έ
I-NORPInside of a NORP entity🏳️
I-ORDINALInside of an ordinal numberπŸ₯‡
I-ORGInside of an organization🏒
I-PERCENTInside of a percentageπŸ“Š
I-PERSONInside of a person’s nameπŸ‘€
I-PRODUCTInside of a productπŸ“±
I-QUANTITYInside of a quantityβš–οΈ
I-TIMEInside of a time⏰
I-WORK_OF_ARTInside of a work of art🎨

Example: "Microsoft opened in Tokyo on January 2025" β†’ [B-ORG, O, O, B-GPE, O, B-DATE, I-DATE]

Performance

Evaluated on the test split (~12,217 examples):

MetricScore
Precision0.85
Recall0.87
F1 Score0.86
Accuracy0.92

Training Setup

  • Hardware: NVIDIA GPU
  • Training Time: ~2 hours
  • Parameters: ~11M
  • Optimizer: AdamW
  • Precision: FP32

Training the Model

Fine-tune on the boltuix/conll2025-ner dataset:


# Install libraries
!pip install transformers datasets tokenizers seqeval pandas pyarrow -q

# Disable Weights & Biases
import os
os.environ["WANDB_MODE"] = "disabled"

# Import libraries
import pandas as pd
import datasets
import numpy as np
from transformers import BertTokenizerFast
from transformers import DataCollatorForTokenClassification
from transformers import AutoModelForTokenClassification
from transformers import TrainingArguments, Trainer
import evaluate
from transformers import pipeline
from collections import defaultdict
import json

# Load dataset
parquet_file = "/content/conll2025-ner.parquet"
df = pd.read_parquet(parquet_file)

# Convert to Hugging Face Dataset
conll2025 = datasets.Dataset.from_pandas(df)

# Extract unique tags
all_tags = set()
for example in conll2025:
    all_tags.update(example["ner_tags"])
unique_tags = sorted(list(all_tags))
tag2id = {tag: i for i, tag in enumerate(unique_tags)}
id2tag = {i: tag for i, tag in enumerate(unique_tags)}

# Convert tags to IDs
def convert_tags_to_ids(example):
    example["ner_tags"] = [tag2id[tag] for tag in example["ner_tags"]]
    return example
conll2025 = conll2025.map(convert_tags_to_ids)

# Split dataset
dataset_dict = {
    "train": conll2025.filter(lambda x: x["split"] == "train"),
    "validation": conll2025.filter(lambda x: x["split"] == "validation"),
    "test": conll2025.filter(lambda x: x["split"] == "test")
}
conll2025 = datasets.DatasetDict(dataset_dict)

# Initialize tokenizer
tokenizer = BertTokenizerFast.from_pretrained("boltuix/NeuroBERT-Mini")

# Tokenize and align labels
def tokenize_and_align_labels(examples, label_all_tokens=True):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            else:
                label_ids.append(label[word_idx] if label_all_tokens else -100)
            previous_word_idx = word_idx
        labels.append(label_ids)
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

tokenized_datasets = conll2025.map(tokenize_and_align_labels, batched=True)

# Initialize model
model = AutoModelForTokenClassification.from_pretrained("boltuix/NeuroBERT-Mini", num_labels=len(unique_tags))

# Training arguments
args = TrainingArguments(
    "boltuix/bert-ner",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=1,
    weight_decay=0.01,
    report_to="none"
)

# Data collator
data_collator = DataCollatorForTokenClassification(tokenizer)

# Evaluation metric
metric = evaluate.load("seqeval")

# Compute metrics
def compute_metrics(eval_preds):
    pred_logits, labels = eval_preds
    pred_logits = np.argmax(pred_logits, axis=2)
    predictions = [
        [unique_tags[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(pred_logits, labels)
    ]
    true_labels = [
        [unique_tags[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(pred_logits, labels)
    ]
    results = metric.compute(predictions=predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

# Initialize trainer
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

# Train
trainer.train()

# Save model
model.save_pretrained("boltuix/bert-ner")
tokenizer.save_pretrained("tokenizer")

# Update config
id2label = {str(i): label for i, label in enumerate(unique_tags)}
label2id = {label: str(i) for i, label in enumerate(unique_tags)}
config = json.load(open("boltuix/bert-ner/config.json"))
config["id2label"] = id2label
config["label2id"] = label2id
json.dump(config, open("boltuix/bert-ner/config.json", "w"))

# Load fine-tuned model
model_fine_tuned = AutoModelForTokenClassification.from_pretrained("boltuix/bert-ner")

# Inference pipeline
nlp = pipeline("token-classification", model=model_fine_tuned, tokenizer=tokenizer)

# Example inference
example = "On July 4th, 2023, President Joe Biden visited the United Nations headquarters in New York to deliver a speech about international law and donated $5 million to relief efforts."
ner_results = nlp(example)
print("NER results:", ner_results)

# Process address
example = "This page contains information about the property located at 1275 Kinnear Rd, Columbus, OH, 43212."
ner_results = nlp(example)
entities = defaultdict(list)
current_entity = ""
current_type = ""
for item in ner_results:
    entity = item["entity"]
    word = item["word"]
    if word.startswith("##"):
        current_entity += word[2:]
    elif entity.startswith("B-"):
        if current_entity and current_type:
            entities[current_type].append(current_entity.strip())
        current_type = entity[2:].lower()
        current_entity = word
    elif entity.startswith("I-") and entity[2:].lower() == current_type:
        current_entity += " " + word
    else:
        if current_entity and current_type:
            entities[current_type].append(current_entity.strip())
        current_entity = ""
        current_type = ""
if current_entity and current_type:
    entities[current_type].append(current_entity.strip())
print("Structured NER output:")
print(json.dumps(dict(entities), indent=2))
                        

Tips: Adjust hyperparameters (e.g., learning_rate, batch_size) or enable fp16 for faster training. Verify split sizes with dataset.num_rows.

Evaluation Code

Evaluate on custom data:


from transformers import AutoTokenizer, AutoModelForTokenClassification
from seqeval.metrics import classification_report
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("boltuix/NeuroBERT-NER")
model = AutoModelForTokenClassification.from_pretrained("boltuix/NeuroBERT-NER")

# Sample test data
texts = ["Barack Obama visited Microsoft in Seattle on January 2025."]
true_labels = [["B-PERSON", "I-PERSON", "O", "B-ORG", "O", "B-GPE", "O", "B-DATE", "I-DATE", "O"]]

pred_labels = []
for text in texts:
    inputs = tokenizer(text, return_tensors="pt", is_split_into_words=False, return_attention_mask=True)
    with torch.no_grad():
        outputs = model(**inputs)
    predictions = outputs.logits.argmax(dim=-1)[0].cpu().numpy()
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
    word_ids = inputs.word_ids(batch_index=0)
    word_preds = []
    previous_word_idx = None
    for idx, word_idx in enumerate(word_ids):
        if word_idx is None or word_idx == previous_word_idx:
            continue
        label = model.config.id2label[predictions[idx]]
        word_preds.append(label)
        previous_word_idx = word_idx
    pred_labels.append(word_preds)

# Evaluate
print("Predicted:", pred_labels)
print("True     :", true_labels)
print("\nπŸ“Š Evaluation Report:\n")
print(classification_report(true_labels, pred_labels))
                        

Dataset Details

  • Entries: 143,709
  • Size: 6.38 MB (Parquet)
  • Columns: split, tokens, ner_tags
  • Splits: Train (~115,812), Validation (~15,680), Test (~12,217)
  • Tags: 36 (18 types with B-/I- + O)
  • Source: News, user-generated content, research corpora

Visualizing NER Tags

Visualize tag distribution:


import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt

# Load dataset
df = pd.read_parquet("conll2025_ner.parquet")

# Flatten ner_tags
all_tags = [tag for tags in df["ner_tags"] for tag in tags]
tag_counts = Counter(all_tags)

# Plot
plt.figure(figsize=(12, 7))
plt.bar(tag_counts.keys(), tag_counts.values(), color="#36A2EB")
plt.title("CoNLL 2025 NER: Tag Distribution", fontsize=16)
plt.xlabel("NER Tag", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.xticks(rotation=45, ha="right", fontsize=10)
plt.grid(axis="y", linestyle="--", alpha=0.7)
plt.tight_layout()
plt.savefig("ner_tag_distribution.png")
plt.show()
                        

Comparison to Other Models

ModelDatasetParametersF1 ScoreSize
NeuroBERT-NERconll2025-ner~11M0.86~50 MB
BERT-base-NERCoNLL-2003~110M~0.89~400 MB
DistilBERT-NERCoNLL-2003~66M~0.85~200 MB
spaCy (en_core_web_lg)OntoNotes-~0.83~800 MB

Community and Support

Contact

Last Updated

May 28, 2025 β€” Released v1.1 with fine-tuning on conll2025-ner.

Conclusion

NeuroBERT-NER offers lightweight, high-accuracy NER for edge AI and IoT, identifying 36 entity types with an F1 score of 0.86. Perfect for information extraction, chatbots, and knowledge graphs, it’s your solution for efficient NLP in 2025. Explore it on Hugging Face! (Placeholder URL, update when available)

Boltuix .store