boltuix/NeuroBERT-NER is a fine-tuned transformer model for Named Entity Recognition (NER), built on boltuix/NeuroBERT-Mini. It identifies 36 entity types (e.g., PERSON, GPE, ORG, DATE) in English text, optimized for edge AI, IoT, and applications like information extraction, chatbots, and knowledge graphs. With a size of ~50MB and ~11M parameters, it offers low-latency, offline processing for privacy-first environments.
NeuroBERT-NER delivers high-accuracy NER with lightweight efficiency for edge devices.
Perform NER with the following Python code:
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("boltuix/NeuroBERT-NER")
model = AutoModelForTokenClassification.from_pretrained("boltuix/NeuroBERT-NER")
# Input text
text = "Barack Obama visited Microsoft headquarters in Seattle on January 2025."
inputs = tokenizer(text, return_tensors="pt")
# Run inference
with torch.no_grad():
outputs = model(**inputs)
predictions = outputs.logits.argmax(dim=-1)
# Map predictions to labels
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
label_map = model.config.id2label
labels = [label_map[p.item()] for p in predictions[0]]
# Print results
for token, label in zip(tokens, labels):
if token not in tokenizer.all_special_tokens:
print(f"{token:15} β {label}")
Example Output:
Barack β B-PERSON
Obama β I-PERSON
visited β O
Microsoft β B-ORG
headquarters β O
in β O
Seattle β B-GPE
on β O
January β B-DATE
2025 β I-DATE
. β O
pip install transformers torch pandas pyarrow seqeval
Python: 3.8+ | Storage: ~50 MB | Optional: CUDA for GPU
Supports 36 NER tags using the BIO scheme:
Tag Name | Purpose | Emoji |
---|---|---|
O | Outside of any named entity | π« |
B-CARDINAL | Beginning of a cardinal number | π’ |
B-DATE | Beginning of a date | ποΈ |
B-EVENT | Beginning of an event | π |
B-FAC | Beginning of a facility | ποΈ |
B-GPE | Beginning of a geopolitical entity | π |
B-LANGUAGE | Beginning of a language | π£οΈ |
B-LAW | Beginning of a law/legal document | π |
B-LOC | Beginning of a non-GPE location | πΊοΈ |
B-MONEY | Beginning of a monetary value | πΈ |
B-NORP | Beginning of a nationality/religious/political group | π³οΈ |
B-ORDINAL | Beginning of an ordinal number | π₯ |
B-ORG | Beginning of an organization | π’ |
B-PERCENT | Beginning of a percentage | π |
B-PERSON | Beginning of a personβs name | π€ |
B-PRODUCT | Beginning of a product | π± |
B-QUANTITY | Beginning of a quantity | βοΈ |
B-TIME | Beginning of a time | β° |
B-WORK_OF_ART | Beginning of a work of art | π¨ |
I-CARDINAL | Inside of a cardinal number | π’ |
I-DATE | Inside of a date | ποΈ |
I-EVENT | Inside of an event | π |
I-FAC | Inside of a facility | ποΈ |
I-GPE | Inside of a geopolitical entity | π |
I-LANGUAGE | Inside of a language | π£οΈ |
I-LAW | Inside of a legal document | π |
I-LOC | Inside of a location | πΊοΈ |
I-MONEY | Inside of a monetary value | πΈ |
I-NORP | Inside of a NORP entity | π³οΈ |
I-ORDINAL | Inside of an ordinal number | π₯ |
I-ORG | Inside of an organization | π’ |
I-PERCENT | Inside of a percentage | π |
I-PERSON | Inside of a personβs name | π€ |
I-PRODUCT | Inside of a product | π± |
I-QUANTITY | Inside of a quantity | βοΈ |
I-TIME | Inside of a time | β° |
I-WORK_OF_ART | Inside of a work of art | π¨ |
Example: "Microsoft opened in Tokyo on January 2025" β [B-ORG, O, O, B-GPE, O, B-DATE, I-DATE]
Evaluated on the test split (~12,217 examples):
Metric | Score |
---|---|
Precision | 0.85 |
Recall | 0.87 |
F1 Score | 0.86 |
Accuracy | 0.92 |
Fine-tune on the boltuix/conll2025-ner dataset:
# Install libraries
!pip install transformers datasets tokenizers seqeval pandas pyarrow -q
# Disable Weights & Biases
import os
os.environ["WANDB_MODE"] = "disabled"
# Import libraries
import pandas as pd
import datasets
import numpy as np
from transformers import BertTokenizerFast
from transformers import DataCollatorForTokenClassification
from transformers import AutoModelForTokenClassification
from transformers import TrainingArguments, Trainer
import evaluate
from transformers import pipeline
from collections import defaultdict
import json
# Load dataset
parquet_file = "/content/conll2025-ner.parquet"
df = pd.read_parquet(parquet_file)
# Convert to Hugging Face Dataset
conll2025 = datasets.Dataset.from_pandas(df)
# Extract unique tags
all_tags = set()
for example in conll2025:
all_tags.update(example["ner_tags"])
unique_tags = sorted(list(all_tags))
tag2id = {tag: i for i, tag in enumerate(unique_tags)}
id2tag = {i: tag for i, tag in enumerate(unique_tags)}
# Convert tags to IDs
def convert_tags_to_ids(example):
example["ner_tags"] = [tag2id[tag] for tag in example["ner_tags"]]
return example
conll2025 = conll2025.map(convert_tags_to_ids)
# Split dataset
dataset_dict = {
"train": conll2025.filter(lambda x: x["split"] == "train"),
"validation": conll2025.filter(lambda x: x["split"] == "validation"),
"test": conll2025.filter(lambda x: x["split"] == "test")
}
conll2025 = datasets.DatasetDict(dataset_dict)
# Initialize tokenizer
tokenizer = BertTokenizerFast.from_pretrained("boltuix/NeuroBERT-Mini")
# Tokenize and align labels
def tokenize_and_align_labels(examples, label_all_tokens=True):
tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
labels = []
for i, label in enumerate(examples["ner_tags"]):
word_ids = tokenized_inputs.word_ids(batch_index=i)
previous_word_idx = None
label_ids = []
for word_idx in word_ids:
if word_idx is None:
label_ids.append(-100)
elif word_idx != previous_word_idx:
label_ids.append(label[word_idx])
else:
label_ids.append(label[word_idx] if label_all_tokens else -100)
previous_word_idx = word_idx
labels.append(label_ids)
tokenized_inputs["labels"] = labels
return tokenized_inputs
tokenized_datasets = conll2025.map(tokenize_and_align_labels, batched=True)
# Initialize model
model = AutoModelForTokenClassification.from_pretrained("boltuix/NeuroBERT-Mini", num_labels=len(unique_tags))
# Training arguments
args = TrainingArguments(
"boltuix/bert-ner",
eval_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=1,
weight_decay=0.01,
report_to="none"
)
# Data collator
data_collator = DataCollatorForTokenClassification(tokenizer)
# Evaluation metric
metric = evaluate.load("seqeval")
# Compute metrics
def compute_metrics(eval_preds):
pred_logits, labels = eval_preds
pred_logits = np.argmax(pred_logits, axis=2)
predictions = [
[unique_tags[p] for (p, l) in zip(prediction, label) if l != -100]
for prediction, label in zip(pred_logits, labels)
]
true_labels = [
[unique_tags[l] for (p, l) in zip(prediction, label) if l != -100]
for prediction, label in zip(pred_logits, labels)
]
results = metric.compute(predictions=predictions, references=true_labels)
return {
"precision": results["overall_precision"],
"recall": results["overall_recall"],
"f1": results["overall_f1"],
"accuracy": results["overall_accuracy"],
}
# Initialize trainer
trainer = Trainer(
model,
args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
data_collator=data_collator,
tokenizer=tokenizer,
compute_metrics=compute_metrics
)
# Train
trainer.train()
# Save model
model.save_pretrained("boltuix/bert-ner")
tokenizer.save_pretrained("tokenizer")
# Update config
id2label = {str(i): label for i, label in enumerate(unique_tags)}
label2id = {label: str(i) for i, label in enumerate(unique_tags)}
config = json.load(open("boltuix/bert-ner/config.json"))
config["id2label"] = id2label
config["label2id"] = label2id
json.dump(config, open("boltuix/bert-ner/config.json", "w"))
# Load fine-tuned model
model_fine_tuned = AutoModelForTokenClassification.from_pretrained("boltuix/bert-ner")
# Inference pipeline
nlp = pipeline("token-classification", model=model_fine_tuned, tokenizer=tokenizer)
# Example inference
example = "On July 4th, 2023, President Joe Biden visited the United Nations headquarters in New York to deliver a speech about international law and donated $5 million to relief efforts."
ner_results = nlp(example)
print("NER results:", ner_results)
# Process address
example = "This page contains information about the property located at 1275 Kinnear Rd, Columbus, OH, 43212."
ner_results = nlp(example)
entities = defaultdict(list)
current_entity = ""
current_type = ""
for item in ner_results:
entity = item["entity"]
word = item["word"]
if word.startswith("##"):
current_entity += word[2:]
elif entity.startswith("B-"):
if current_entity and current_type:
entities[current_type].append(current_entity.strip())
current_type = entity[2:].lower()
current_entity = word
elif entity.startswith("I-") and entity[2:].lower() == current_type:
current_entity += " " + word
else:
if current_entity and current_type:
entities[current_type].append(current_entity.strip())
current_entity = ""
current_type = ""
if current_entity and current_type:
entities[current_type].append(current_entity.strip())
print("Structured NER output:")
print(json.dumps(dict(entities), indent=2))
Tips: Adjust hyperparameters (e.g., learning_rate, batch_size) or enable fp16 for faster training. Verify split sizes with dataset.num_rows.
Evaluate on custom data:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from seqeval.metrics import classification_report
import torch
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("boltuix/NeuroBERT-NER")
model = AutoModelForTokenClassification.from_pretrained("boltuix/NeuroBERT-NER")
# Sample test data
texts = ["Barack Obama visited Microsoft in Seattle on January 2025."]
true_labels = [["B-PERSON", "I-PERSON", "O", "B-ORG", "O", "B-GPE", "O", "B-DATE", "I-DATE", "O"]]
pred_labels = []
for text in texts:
inputs = tokenizer(text, return_tensors="pt", is_split_into_words=False, return_attention_mask=True)
with torch.no_grad():
outputs = model(**inputs)
predictions = outputs.logits.argmax(dim=-1)[0].cpu().numpy()
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
word_ids = inputs.word_ids(batch_index=0)
word_preds = []
previous_word_idx = None
for idx, word_idx in enumerate(word_ids):
if word_idx is None or word_idx == previous_word_idx:
continue
label = model.config.id2label[predictions[idx]]
word_preds.append(label)
previous_word_idx = word_idx
pred_labels.append(word_preds)
# Evaluate
print("Predicted:", pred_labels)
print("True :", true_labels)
print("\nπ Evaluation Report:\n")
print(classification_report(true_labels, pred_labels))
Visualize tag distribution:
import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt
# Load dataset
df = pd.read_parquet("conll2025_ner.parquet")
# Flatten ner_tags
all_tags = [tag for tags in df["ner_tags"] for tag in tags]
tag_counts = Counter(all_tags)
# Plot
plt.figure(figsize=(12, 7))
plt.bar(tag_counts.keys(), tag_counts.values(), color="#36A2EB")
plt.title("CoNLL 2025 NER: Tag Distribution", fontsize=16)
plt.xlabel("NER Tag", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.xticks(rotation=45, ha="right", fontsize=10)
plt.grid(axis="y", linestyle="--", alpha=0.7)
plt.tight_layout()
plt.savefig("ner_tag_distribution.png")
plt.show()
Model | Dataset | Parameters | F1 Score | Size |
---|---|---|---|---|
NeuroBERT-NER | conll2025-ner | ~11M | 0.86 | ~50 MB |
BERT-base-NER | CoNLL-2003 | ~110M | ~0.89 | ~400 MB |
DistilBERT-NER | CoNLL-2003 | ~66M | ~0.85 | ~200 MB |
spaCy (en_core_web_lg) | OntoNotes | - | ~0.83 | ~800 MB |
May 28, 2025 β Released v1.1 with fine-tuning on conll2025-ner.
NeuroBERT-NER offers lightweight, high-accuracy NER for edge AI and IoT, identifying 36 entity types with an F1 score of 0.86. Perfect for information extraction, chatbots, and knowledge graphs, itβs your solution for efficient NLP in 2025. Explore it on Hugging Face! (Placeholder URL, update when available)