See: Li, F. (2008). Annual report readability, current earnings, and earnings persistence. Journal of Accounting and Economics, 45(2–3), 221–247. https://doi.org/10.1016/j.jacceco.2008.02.003
Table 3, Li (2008)
Earnings is operating earnings (data178 of Compustat) scaled by book value of assets.
Length is the natural logarithm of the total words in annual report.
This tutorial will walk you through a simple regression analysis on financial data using Python. We'll use the pandas
, numpy
, and statsmodels
libraries along with my library, jtext
.
Step 1: Import Necessary Libraries
xxxxxxxxxx
import pandas as pd # Data handling
import numpy as np # Numerical computations
import statsmodels.api as sm # Statistical modeling
from jtext import JText # Text processing
These libraries are essential for loading, transforming, and analyzing data.
Step 2: Load the Dataset
xxxxxxxxxx
fin_df = pd.read_csv("financials.csv", encoding="utf-16", sep="\t")
fin_df.columns # Check the available columns
Load a CSV file containing financial data.
The file uses UTF-16
encoding and tabs (\t
) as separators.
Use fin_df.columns
to inspect the column names.
Step 3: Filter Relevant Data
xxxxxxxxxx
fin_df = fin_df[["docID", "annual_csv", "資産", "経常利益又は経常損失(△)"]]
fin_df.head() # View the first few rows
Select columns that are relevant for the analysis:
docID
: Document identifier.
annual_csv
: Text data for annual reports.
資産
(Assets): Total assets.
経常利益又は経常損失(△)
(Profit or Loss): Key financial performance metric.
Step 4: Build Variables
xxxxxxxxxx
fin_df["Earnings"] = fin_df["経常利益又は経常損失(△)"] / fin_df["資産"]
Create a new column, Earnings
, representing profitability divided by total assets.
Step 5: Process Text Data
xxxxxxxxxx
len_list = [] # Initialize a list to store log-transformed lengths
for f in fin_df["annual_csv"]:
text = JText(f) # Create a JText object
length = text.get_length() # Get the length of the text
log_len = np.log(length) # Compute the natural log of the length
len_list.append(log_len) # Append to the list
fin_df["length"] = len_list # Add the list as a new column
For each entry in annual_csv
:
Measure the length of the text.
Apply a logarithmic transformation for normalization.
Step 6: Clean the Data
xxxxxxxxxx
fin_df = fin_df.dropna()
Remove rows with missing values to ensure a clean dataset.
Step 7: Prepare Data for Regression
xxxxxxxxxx
X = sm.add_constant(fin_df["length"]) # Add a constant for the intercept
y = fin_df["Earnings"] # Target variable
Add a constant column to include an intercept term in the regression model.
Define the dependent variable y
as Earnings
.
Step 8: Fit the Regression Model
xxxxxxxxxx
model = sm.OLS(y, X).fit()
print(model.summary())
Fit an Ordinary Least Squares (OLS) regression model.
Print a summary of the regression results, including coefficients, R-squared values, and p-values.