top of page
TEXT MINING FOR HISTORIANS COURSE
UNIT 6: ANALYSIS WITH PYTHON Pt 2
This unit is composed of four lectures which will cover more advanced text mining analysis with Python.
Lecture A: Processing Texts
This session runs in an interactive notebook on MyBinder. Click for more information on how to access and run a notebook. An overview of all interactive materials is available here.
This lesson introduces core Python objects such as lists and dictionaries that you will need when processing text files. We discuss the application of Natural Language Processing tools to historical documents. More precisely, we show how to use the NLTK and SpaCy to splitting a text into tokens and analyse the grammatical structure of a sentence with part-of-speech tagging.
Lecture B: Corpus Selection
This session runs in an interactive notebook on MyBinder. Click for more information on how to access and run a notebook. An overview of all interactive materials is available here.
In this notebook, we introduce techniques for selecting relevant information from large data sets. We discuss how to filter and select information based on their metadata as well as textual content. The strategies covered here allow you to select documents that are relevant to your research question and build question-specific subcorpora,
Lecture C: Corpus Exploration
This session runs in an interactive notebook on MyBinder. Click for more information on how to access and run a notebook. An overview of all interactive materials is available here.
After building a subcorpus, you need tools to explore and analyse the texts meaningfully. We focus on a wide range of tools provided by the Natural Language Toolkit, such as concordance or Keyword in Context (KWIC), collocation analysis and feature selection. We use reports written by Victorian Medical Officers of Health as a case study.
Lecture D: Trends over Time
This session runs in an interactive notebook on MyBinder. Click for more information on how to access and run a notebook. An overview of all interactive materials is available here.
The last notebook in the text mining series focuses on studying discursive trends over time. The goal of this notebook is to understand the changing content of British political manifestos.
FURTHER READINGS
*Karsdorp, Folgert, Mike Kestemont, and Allen Riddell. Humanities Data Analysis: Case Studies with Python. Princeton University Press, 2021.
*Kokensparger, Brian. Guide to Programming for the Digital Humanities: Lessons for Introductory Python. Springer, 2018.
*Loper, Edward, and Steven Bird. "Nltk: The natural language toolkit." arXiv preprint cs/0205028 (2002).
*Lutz, M., 2013. Learning python: Powerful object-oriented programming. " O'Reilly Media, Inc.".
*Martelli, Alex, Anna Ravenscroft, and David Ascher. Python cookbook. " O'Reilly Media, Inc.", 2005.
*Perkins, Jacob. Python 3 text processing with NLTK 3 cookbook. Packt Publishing Ltd, 2014.
*Salganik, Matthew J. Bit by bit: Social research in the digital age. Princeton University Press, 2019.
*VanderPlas, Jake. Python data science handbook: Essential tools for working with data. " O'Reilly Media, Inc.", 2016.
EXERCISES
Integrated into the course (please see above)
bottom of page