David Smith

Efficient inference for machine learning models with complex latent structure
Modeling natural language structures, such as morphology, syntax, and semantics
Modeling the mutations in texts as they propagate through social networks and in language across space and time
Interactive information retrieval and machine learning for expert users

PhD in Computer Science, Johns Hopkins University
BA in Classics, Harvard University

David A. Smith is an associate professor in the Khoury College of Computer Sciences at Northeastern University, based in Boston. He is a founding member of the NULab for Texts, Maps, and Networks, Northeastern University’s center for the digital humanities and computational social sciences.

Prior to joining Northeastern, Smith was a professor at the University of Massachusetts and a contributor to Tufts University's Perseus Digital Library, one of the most widely used linguistic and cultural research systems in the humanities field. Funded by the NSF, NEH, DARPA, ONR, AFRL, the Mellon Foundation, and Google, Smith has published widely in natural language processing and computational linguistics, information retrieval, digital libraries, digital humanities, and political science.

NULab for Texts, Maps, and Networks

Published: August 26th, 2021
The Open Islamicate Texts Initiative Arabic-script OCR Catalyst Project (OpenITI AOCP)

Lead PI: David Smith

Co PIs: Alejandro Toselli, , Ryan Muther

Published: February 21st, 2025
Privacy Ripple Effects from Adding or Removing Personal Information in Language Model Training

Citation: Jaydeep Borkar, Matthew Jagielski, Katherine Lee, Niloofar Mireshghallah, David A. Smith, Christopher A. Choquette-Choo. (2025). Privacy Ripple Effects from Adding or Removing Personal Information in Language Model Training CoRR, abs/2502.15680. https://doi.org/10.48550/arXiv.2502.15680
Published: September 11th, 2024
MONSTERMASH: Multidirectional, Overlapping, Nested, Spiral Text Extraction for Recognition Models of Arabic-Script Handwriting

Citation: Danlu Chen, Jacob Murel, Taimoor Shahid, Xiang Zhang, Jonathan Parkes Allen, Taylor Berg-Kirkpatrick, David A. Smith. (2024). MONSTERMASH: Multidirectional, Overlapping, Nested, Spiral Text Extraction for Recognition Models of Arabic-Script Handwriting ICDAR (Workshops 2), 87-101. https://doi.org/10.1007/978-3-031-70642-4_6
Published: September 11th, 2024
Retrieving and Analyzing Translations of American Newspaper Comics with Visual Evidence

Citation: Jacob Murel, David A. Smith. (2024). Retrieving and Analyzing Translations of American Newspaper Comics with Visual Evidence ICDAR (Workshops 1), 125-137. https://doi.org/10.1007/978-3-031-70645-5_9
Published: September 9th, 2024
Self-training and Active Learning with Pseudo-relevance Feedback for Handwriting Detection in Historical Print

Citation: Jacob Murel, David A. Smith. (2024). Self-training and Active Learning with Pseudo-relevance Feedback for Handwriting Detection in Historical Print ICDAR (3), 305-324. https://doi.org/10.1007/978-3-031-70543-4_18
Published: June 28th, 2024
Mind the Gap: Analyzing Lacunae with Transformer-Based Transcription

Citation: Jaydeep Borkar, David A. Smith. (2024). Mind the Gap: Analyzing Lacunae with Transformer-Based Transcription CoRR, abs/2407.00250. https://doi.org/10.48550/arXiv.2407.00250
Published: February 24th, 2024
Detecting Manuscript Annotations in Historical Print: Negative Evidence and Evaluation Metrics

Citation: Jacob Murel, David A. Smith. (2024). Detecting Manuscript Annotations in Historical Print: Negative Evidence and Evaluation Metrics ICPRAM, 745-752. https://doi.org/10.5220/0012365600003654
Published: December 6th, 2023
Automatic Collation for Diversifying Corpora: Commonly Copied Texts as Distant Supervision for Handwritten Text Recognition

Citation: David A. Smith, Jacob Murel, Jonathan Parkes Allen, Matthew Thomas Miller. (2023). Automatic Collation for Diversifying Corpora: Commonly Copied Texts as Distant Supervision for Handwritten Text Recognition CHR, 206-221. https://ceur-ws.org/Vol-3558/paper1708.pdf
Published: December 6th, 2023
Testing the Limits of Neural Sentence Alignment Models on Classical Greek and Latin Texts and Translations

Citation: Caroline Craig, Kartik Goyal, Gregory R. Crane, Farnoosh Shamsian, David A. Smith. (2023). Testing the Limits of Neural Sentence Alignment Models on Classical Greek and Latin Texts and Translations CHR, 530-553. https://ceur-ws.org/Vol-3558/paper6193.pdf
Published: June 5th, 2023
Composition and Deformance: Measuring Imageability with a Text-to-Image Model

Citation: Si Wu, David A. Smith. (2023). Composition and Deformance: Measuring Imageability with a Text-to-Image Model CoRR, abs/2306.03168. https://doi.org/10.48550/arXiv.2306.03168
Published: May 5th, 2023
Adapting Transformer Language Models for Predictive Typing in Brain-Computer Interfaces

Citation: Shijia Liu, David A. Smith. (2023). Adapting Transformer Language Models for Predictive Typing in Brain-Computer Interfaces CoRR, abs/2305.03819. https://doi.org/10.48550/arXiv.2305.03819
Published: December 8th, 2022
An Experiment in Live Collaborative Programming on the Croquet Shared Experience Platform

Citation: Yoshiki Ohshima, Aran Lunzer, Jenn Evans, Vanessa Freudenberg, Brian Upton, David A. Smith. (2022). An Experiment in Live Collaborative Programming on the Croquet Shared Experience Platform Programming, 46-53. https://doi.org/10.1145/3532512.3535224
Published: February 27th, 2021
Text mining Mill: Computationally detecting influence in the writings of John Stuart Mill from library records

Citation: Helen O'Neill, Anne Welsh, David A. Smith, Glenn Roe, Melissa Terras. (2021). Text mining Mill: Computationally detecting influence in the writings of John Stuart Mill from library records Digit. Scholarsh. Humanit., 36, 1013-1029. https://doi.org/10.1093/llc/fqab010
Published: January 1st, 2021
Content-based Models of Quotation

Citation: Ansel MacLaughlin, David A. Smith. (2021). Content-based Models of Quotation EACL, 2296-2314. https://doi.org/10.18653/v1/2021.eacl-main.195
Published: February 8th, 2018
Contrastive Training for Models of Information Cascades

Citation: Shaobin Xu, David A. Smith. (2018). Contrastive Training for Models of Information Cascades AAAI, 483-490. https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17086
Published: June 27th, 2014
Detecting and Evaluating Local Text Reuse in Social Networks

Citation: Shaobin Xu, David Smith, Abigail Mullen, and Ryan Cordell. Detecting and evaluating local text reuse in social networks. In ACL Joint Workshop on Social Dynamics and Personal Attributes in Social Media, 2014.

Current PhD students

Shijia Liu

PhD Student
Read bio
Si Wu

PhD Student
Read bio

Previous PhD students

Rui Dong

PhD Student
Liwen Hou

PhD Student
Xinyu Hua

PhD Student
Ansel MacLaughlin

PhD Student
Ryan Muther

PhD Student
Shaobin Xu

PhD Student

Khoury College Class of 2025 Celebration

Dean’s Welcome To Our Community

Experiential Learning

Global Campus Experience

Redesigned introductory computing courses

This week: Northeastern at ACM CHI 2025

Hiring a co-op student: What to know

Careers at Khoury College

The Open Islamicate Texts Initiative Arabic-script OCR Catalyst Project (OpenITI AOCP)

Privacy Ripple Effects from Adding or Removing Personal Information in Language Model Training

MONSTERMASH: Multidirectional, Overlapping, Nested, Spiral Text Extraction for Recognition Models of Arabic-Script Handwriting

Retrieving and Analyzing Translations of American Newspaper Comics with Visual Evidence

Self-training and Active Learning with Pseudo-relevance Feedback for Handwriting Detection in Historical Print

Mind the Gap: Analyzing Lacunae with Transformer-Based Transcription

Detecting Manuscript Annotations in Historical Print: Negative Evidence and Evaluation Metrics

Automatic Collation for Diversifying Corpora: Commonly Copied Texts as Distant Supervision for Handwritten Text Recognition

Testing the Limits of Neural Sentence Alignment Models on Classical Greek and Latin Texts and Translations

Composition and Deformance: Measuring Imageability with a Text-to-Image Model

Adapting Transformer Language Models for Predictive Typing in Brain-Computer Interfaces

An Experiment in Live Collaborative Programming on the Croquet Shared Experience Platform

Text mining Mill: Computationally detecting influence in the writings of John Stuart Mill from library records

Content-based Models of Quotation

Contrastive Training for Models of Information Cascades

Detecting and Evaluating Local Text Reuse in Social Networks

Related news

Artificial Intelligence Helps Computers Leap Forward in Reading Arabic

Google may soon give you your own robo-calling assistant

Northeastern Magazine | Prof. David Smith collaborates with fellow NU faculty to create revolutionary search tool

Current PhD students

Shijia Liu

Si Wu

Previous PhD students

Rui Dong

Liwen Hou

Xinyu Hua

Ansel MacLaughlin

Ryan Muther

Shaobin Xu

David Smith

Research interests

Education

Biography

Labs and groups

Projects

Recent publications

Related news

Current PhD students

Previous PhD students

Rui Dong

Liwen Hou

Xinyu Hua

Ansel MacLaughlin

Ryan Muther

Shaobin Xu