LexisNexis Newspaper Analysis in Python

Lilian Li
5 min readFeb 3, 2022

LexisNexis is one of the largest database for news articles, publications, and law cases. It is widely used by researchers and analyst who are interested in different text analysis (E.g. newspaper analysis, business case study…). For my thesis, I also used LexisNexis for a content analysis project on articles covering Covid-19 regulations. In this blog, I want to introduce some Python codes that can help you get started with your project.

The first few steps we often do for newspaper analysis is 1) find the keyword, 2) search on LexisNexis and 3) download your files.

The first two steps are of course up to you. There is only one tiny problem with the third step. The database offers download in PDF, Word, or RTF formats. You might ask “Where is my beloved .json?”. Ok, we have our first problem, how do we open RTF files in batches?

Open (multiple) RTF files in Python

Here is what we do. First, a small reminder, when you download the files, stored them in a systematic way.

Then, you can use glob to find files in a recursive way. The following code block is looking for articles with any jibberish name in the format of .rtf , in the folders of group_1, “group_A, group…..group_x, under the mother folder rtf. It finds each file and store their pathname in a list.

# Generate a list of file namesimport globfile_name = []
for rtf in glob.glob('rtf/group_*/*.rtf'):
file_name.append(rtf)

Second, get striprtf package by pip install striprtf . Then, here is a small function you can use to read rtf files by its name and return the plain text.

from striprtf.striprtf import rtf_to_textdef read_rtf_to_text(file_name):
'''
read one file by its name
:param file_name:
:return: plain text
'''
with open(file_name, 'r') as file:
file_text = file.read()
text = rtf_to_text(file_text, errors="ignore")
return text

Lastly, you can now read all the RTF files you have by feeding the file’s name path to the function and stored it in a list.

for name in file_name:
article = read_rtf_to_text(name)
articles_list.append(article)

Extract information in text

So here we have a huge list with only plain texts. Now what? The next steps depends on the goal of your research. If you are only interested in the texts, or more information. For example, the source, the length, the published date, the title…

The good thing about LexisNexis data is the format and metadata. As you can see below, the format of news articles are mostly the same. It also has a metadata section (to the right) where you can find more details about the article.

This opens up possibility for using some easy re or partition functions to collect the information we need. Here let’s say we need the load-date (January 18, 2021) in classification and name of news source (The Australian).

For the load date, you can use partition , as the date is almost at the end of the document and is always preceded by “Load-Date:”.

article.partition("Load-Date:")[2].replace("End of Document", "").strip() # remove the "end of document" line after date and strip the empty space.

Here is a good demostration of how it works (link). But basically, it finds Load-Date: in the text and use it as a separator.

For example, in “I think it is cool”, partition("it"), will produce a list list = ["I think", "it", "is cool"] . If you call list[0] , it returns “I think”.

Next, let’s use re to find the news outlet The Australian. The code below shows the simplest way of doing it (but prone to error).

import reoutlet_re = re.search(r"The Australian", article)
outlet = outlet_re.group(0) if outlet_re else ""

It finds the first occurrence of “The Australian” in one article. If you wish to find more outlets you are interested in, replace the string in r"" with, for example, The Times|The Irish Independent .

As the news outlet is right below the title, it would work pretty well. However, you have to be careful for special situations. For example, if the title contains “The Australian” while the outlet below is actually “Daily Mirror”. In this case, the code would give you a false answer.

This code also only works in the situation where you know what you are looking for. When you don’t know the data, you could write more general regular expressions to build your special solution. RegExr is a good place to start.

In this blog, we walked through the first few steps with text analysis using data from LexisNexis. We read the .rtf files using glob and striprtf . Then, we applied some simple re and partition function to extract existing informations in the text.

Here is what it would look like when you combine everything in a DataFrame.

Hope it helps!

import glob
import re
import pandas as pd
from functions import read_rtf_to_text

# Generate a list of file names
file_name = []
for rtf in glob.glob('rtf/group_*/*.rtf'):
file_name.append(rtf)

# Sort list of file names from group 1-84 and A-Z
file_name = sorted(file_name)

# Read articles by its name and save to a list
articles_list = []

for name in file_name:
article = read_rtf_to_text(name)
articles_list.append(article)

# Construct a dataframe
articles_df = pd.DataFrame()
articles_df["text"] = articles_list
articles_df["title"] = file_name

#find load_date and newsoutlet in population
date = []
outlet_column = []

for article in articles_df["text"]:
date.append(article.partition("Load-Date:")[2].replace("End of Document", "").strip())
outlet_re = re.search(
r"The Irish Times|Irish Independent|Crikey|The Australian|",
article)
outlet = outlet_re.group(0) if outlet_re else ""
outlet_column.append(outlet)
articles_df["date"] = date
articles_df["outlet"] = outlet_column

--

--