Part 3: Practical Examples

Regression Analysis Following Li (2008)

See: Li, F. (2008). Annual report readability, current earnings, and earnings persistence. Journal of Accounting and Economics, 45(2–3), 221–247. https://doi.org/10.1016/j.jacceco.2008.02.003

Regression Equation

Table 3, Li (2008)

E a r n i n g s_{t} = α + β L e n g t h_{t} + ϵ

Earnings is operating earnings (data178 of Compustat) scaled by book value of assets.

Length is the natural logarithm of the total words in annual report.

Step-by-step Tutorial

This tutorial will walk you through a simple regression analysis on financial data using Python. We'll use the pandas, numpy, and statsmodels libraries along with my library, jtext.

Step 1: Import Necessary Libraries


xxxxxxxxxx
import pandas as pd  # Data handling
import numpy as np  # Numerical computations
import statsmodels.api as sm  # Statistical modeling
from jtext import JText  # Text processing

These libraries are essential for loading, transforming, and analyzing data.

Step 2: Load the Dataset


xxxxxxxxxx
fin_df = pd.read_csv("financials.csv", encoding="utf-16", sep="\t")
fin_df.columns  # Check the available columns

Load a CSV file containing financial data.
The file uses UTF-16 encoding and tabs (\t) as separators.
Use fin_df.columns to inspect the column names.

Step 3: Filter Relevant Data


xxxxxxxxxx
fin_df = fin_df[["docID", "annual_csv", "資産", "経常利益又は経常損失（△）"]]
fin_df.head()  # View the first few rows

Select columns that are relevant for the analysis:

docID: Document identifier.
annual_csv: Text data for annual reports.
資産 (Assets): Total assets.
経常利益又は経常損失（△） (Profit or Loss): Key financial performance metric.

Step 4: Build Variables


xxxxxxxxxx
fin_df["Earnings"] = fin_df["経常利益又は経常損失（△）"] / fin_df["資産"]

Create a new column, Earnings, representing profitability divided by total assets.

Step 5: Process Text Data


xxxxxxxxxx
len_list = []  # Initialize a list to store log-transformed lengths

for f in fin_df["annual_csv"]:
    text = JText(f)  # Create a JText object
    length = text.get_length()  # Get the length of the text
    log_len = np.log(length)  # Compute the natural log of the length
    len_list.append(log_len)  # Append to the list

fin_df["length"] = len_list  # Add the list as a new column

For each entry in annual_csv:

Measure the length of the text.
Apply a logarithmic transformation for normalization.

Step 6: Clean the Data


xxxxxxxxxx
fin_df = fin_df.dropna()

Remove rows with missing values to ensure a clean dataset.

Step 7: Prepare Data for Regression


xxxxxxxxxx
X = sm.add_constant(fin_df["length"])  # Add a constant for the intercept
y = fin_df["Earnings"]  # Target variable

Add a constant column to include an intercept term in the regression model.
Define the dependent variable y as Earnings.

Step 8: Fit the Regression Model


xxxxxxxxxx
model = sm.OLS(y, X).fit()
print(model.summary())

Fit an Ordinary Least Squares (OLS) regression model.
Print a summary of the regression results, including coefficients, R-squared values, and p-values.