Kevin Shao

Responsibility and Representation in Natural Language Processing (NLP) Data Practice

HCI NLP Fairness, Representation, and Bias

PIs: Dr. Jay Cunningham, Professors Julie Kientz and Daniela Rosner

The project aimed to understand how NLP/ML technologists' design decisions impact fairness, representation, and bias particularly among linguistically and ethnically diverse user groups. The study focuses on consensus-building strategies, success metrics, and quality assurance methods in NLP data production. It is examining professionals' perspectives and experiences in addressing issues of exclusion, discrimination, and bias in NLP dataset production. By exploring the tensions and challenges in NLP data practices such as transcription, annotation, and analysis, along with the involvement of diverse users, the study seeks to gain a holistic perspective of approaches. The research team emphasizes strategies toward increasing fairness and amplifying diverse user representation, thereby enhancing AI responsibility within NLP data practices. The ultimate goal is to outline a landscape to address the imbalance of power and agency in these processes, thereby promoting more responsible and equitable AI development practices.

Building upon the critical recognition that diversity among data annotators is essential for creating unbiased conversational speech and text systems, this study investigates the broader implications of these practices. It is informed by prior research highlighting the challenges posed by misaligned grounded-truth evaluation metrics, the lack of diversity among data annotators, and the minimal involvement of diverse communities in the development of NLP systems.

Method

This study used a mixed methods approach to investigate current data practices among NLP practitioners.

An online questionnaire administered via Typeform, which gathered responses from 47 eligible professionals. The recruitment targeted individuals with substantial experience in NLP data work.
A virtual semi-structured focus group interviews with 5 NLP professionals that was selected based on their willingness and survey responses.

Survey insights report analyzed using Power BI: LINK

Publication

Jay L. Cunningham, Kevin Shao, Nathanael Elias Mengist, Rock Pang, et al. 2025. Advancing NLP Data Equity: Practitioner Responsibility and Accountability in NLP Data Practices. (Under Review ACM FAccT 2025)

Responsibility and Representation in Natural Language Processing (NLP) Data Practice

PIs: Dr. Jay Cunningham, Professors Julie Kientz and Daniela Rosner

Method

Publication

More Info Coming Soon