Tuesday, June 18, 2024

Effective Prompt Engineering for Document Processing

Abdellatif Abdelfattah

Effective Prompt Engineering for Document Processing

Prompt engineering—the art of crafting effective instructions for language models—is particularly crucial when working with documents. The right prompts can dramatically improve how accurately and reliably AI systems extract, analyze, and summarize information from text. This guide will help you develop more effective prompts specifically for document processing tasks.

Understanding the Document Processing Challenge

Processing documents with LLMs presents several unique challenges:

  1. Length constraints: Documents are often longer than an LLM's context window
  2. Structure preservation: Important formatting and structure must be maintained
  3. Factual accuracy: Outputs must accurately reflect document content
  4. Consistency: Processing must be reliable across similar documents

Core Principles for Document Prompts

1. Be Specific About Format and Structure

When extracting information, clearly specify the output format you want:

Less effective:

Extract the key information from this document.

More effective:

Extract the following fields from this document:
- Document Title: [The title of the document]
- Date: [Publication or creation date in YYYY-MM-DD format]
- Author(s): [Names of authors]
- Key Points: [Bullet list of 3-5 main points]
- Conclusion: [1-2 sentence summary of conclusions]

Format your response as a JSON object with these exact field names.

2. Provide Clear Processing Instructions

For analysis tasks, be explicit about the steps the model should take:

Less effective:

Analyze this legal contract.

More effective:

Analyze this legal contract by following these steps:
1. Identify all parties mentioned and their roles
2. List all key obligations for each party
3. Note any deadlines or important dates
4. Highlight unusual or potentially problematic clauses
5. Summarize termination conditions

For each step, cite the specific section number where the information appears.

3. Include Examples for Complex Tasks

For more complex extraction or formatting tasks, include examples:

Less effective:

Extract all financial figures from this quarterly report.

More effective:

Extract all financial figures from this quarterly report using the following format:

Example:
"The company reported revenue of $5.2M" -> { "type": "revenue", "amount": 5200000, "currency": "USD", "period": "quarterly" }
"Our operating margin improved to 23%" -> { "type": "operating_margin", "value": 23, "unit": "percent" }

Extract ALL financial figures mentioned, including revenue, profit, margins, growth rates, and any other numerical financial indicators.

Chunking and Context Management

For longer documents that exceed context limits, use these techniques:

1. Chunk with Overlap

When processing a document in chunks, include overlap and context:

CONTEXT: You are analyzing a legal contract between Acme Corp and XYZ Inc. This is part 3 of 5 
from the contract. The previous section covered the initial service terms.

CONTENT: [Paste chunk of text here]

Based ONLY on the content provided in this section, identify and list all:
1. Obligations for Acme Corp
2. Obligations for XYZ Inc
3. Important dates or deadlines mentioned

If a point seems to reference something not fully explained in this section, note that it 
may require context from another section.

2. Hierarchical Processing

For very long documents, use a multi-stage approach:

# First pass per section
Summarize this section of the document in exactly 3-5 bullet points, focusing on the main facts and arguments presented.

# Second pass with section summaries
Below are summaries of each section of a research paper. Create a comprehensive summary of the entire paper by synthesizing these section summaries. Structure your response with these headings:
- Research Question
- Methodology 
- Key Findings
- Limitations
- Conclusions

Extraction Tasks

Structured Data Extraction

For extracting specific fields from semi-structured documents:

def extract_invoice_data(invoice_text):
    prompt = f"""
    Extract the following information from this invoice as JSON:
    - Invoice Number
    - Date Issued (in YYYY-MM-DD format)
    - Due Date (in YYYY-MM-DD format)
    - Vendor Name
    - Vendor Address
    - Customer Name
    - Customer Address
    - Line Items (as an array of objects with description, quantity, unit price, and total)
    - Subtotal
    - Tax Amount
    - Total Amount
    
    If any field is not found, output null for that field.
    
    INVOICE:
    {invoice_text}
    """
    
    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"}
    )
    
    return json.loads(response.choices[0].message.content)

Table Extraction

For extracting and processing tables:

The following text contains a table from a financial document. 
Extract and convert it to a properly formatted markdown table.

If there are any footnotes or references in the table, include them as a separate note after the table.

Make sure the columns are aligned and all data is properly categorized.

Analysis Tasks

Comparative Analysis

For comparing multiple documents or sections:

Compare the following two legal clauses related to data privacy policies. For each clause:

1. Identify the core obligations
2. Note any specific requirements or exceptions
3. Assess the scope (what's covered and what's not)

Then provide a direct comparison highlighting:
- Key similarities
- Important differences
- Which clause is more restrictive
- Which clause offers more protection to the end user

CLAUSE 1:
[First clause text]

CLAUSE 2:
[Second clause text]

Critical Evaluation

For evaluating arguments or claims in a document:

Analyze the following research paper abstract and evaluate its claims:

1. Identify the main claim or hypothesis
2. List the key evidence presented to support this claim
3. Identify any potential limitations, biases, or weaknesses in the methodology mentioned
4. Evaluate whether the conclusion follows logically from the evidence presented
5. Note any alternative interpretations of the findings that might be valid

Present your analysis in a balanced way, considering both strengths and weaknesses.

Summarization Tasks

Executive Summary

For creating concise, high-level summaries:

Create an executive summary of this quarterly earnings report that would be suitable for busy executives. Your summary should:

1. Be approximately 250 words
2. Start with the most significant financial results (revenue, profit, EPS)
3. Highlight year-over-year and quarter-over-quarter changes
4. Include key business developments that impacted performance
5. Note forward guidance or projections
6. Use a formal, factual tone without marketing language

Structure your summary with clear paragraphs and include a brief bullet list of 3-4 key takeaways at the end.

Multi-level Summarization

For flexible-length summaries:

Create three different summaries of this document:

1. One-sentence summary (approximately 25 words)
2. Paragraph summary (approximately 100 words)
3. Detailed summary (approximately 500 words)

Each summary should be self-contained and capture progressively more detail while maintaining accuracy to the source material.

Increasing Reliability

Chain-of-Thought Prompting

Encourage the model to reason through its process:

Read the following contract clause and determine whether it allows for unilateral termination by either party.

Think through your analysis step by step:
1. First, identify the key parts of the clause related to termination
2. Note which parties are mentioned in relation to termination rights
3. Identify any conditions that must be met for termination
4. Determine if notice is required and how much
5. Assess whether there are different conditions for different parties

After completing these steps, provide your final answer with a brief explanation.

Self-Verification

Ask the model to verify its own outputs:

Extract all dates mentioned in this legal document in ISO format (YYYY-MM-DD).

After extracting the dates, review your list and verify:
1. That each item is actually a date (not a section number or other numerical value)
2. That each date is formatted correctly in YYYY-MM-DD format
3. That you've included all dates from the document

If you find any errors in your initial extraction, correct them before submitting your final list.

Practical Examples

Financial Document Analysis

def analyze_financial_report(report_text):
    prompt = f"""
    You are a financial analyst reviewing a quarterly earnings report. Analyze the following report and provide:
    
    1. KEY METRICS (as a JSON object):
       - Revenue (with % change YoY)
       - Net Income (with % change YoY)
       - EPS (with % change YoY)
       - Operating Margin
       - Any other prominently highlighted KPIs
    
    2. PERFORMANCE ANALYSIS (in 200-300 words):
       - Main drivers of performance
       - Areas of strength
       - Areas of concern
       - Management's forward guidance
    
    3. KEY RISKS (list 3-5 bullet points):
       - Identify explicit risks mentioned in the report
    
    Report content:
    {report_text}
    """
    
    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )
    
    return response.choices[0].message.content

Academic Paper Summarization

def summarize_research_paper(paper_text):
    prompt = f"""
    Summarize the following research paper using this structure:
    
    RESEARCH QUESTION:
    [1-2 sentences stating the primary question or objective of the research]
    
    METHODOLOGY:
    [2-3 sentences describing the approach, data sources, and analytical methods]
    
    KEY FINDINGS:
    - [List 3-5 bullet points of the most important results]
    
    IMPLICATIONS:
    [2-3 sentences on why these findings matter and how they contribute to the field]
    
    LIMITATIONS:
    [1-2 sentences on stated limitations of the research]
    
    Base your summary STRICTLY on the content of the paper and avoid introducing external information or opinions.
    
    Paper content:
    {paper_text}
    """
    
    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )
    
    return response.choices[0].message.content

Best Practices for Document Processing

  1. Start specific, then refine: Begin with very specific prompts and adjust based on results
  2. Maintain context: Always provide relevant context when processing document chunks
  3. Verify outputs: Include steps for the model to verify its outputs against the source
  4. Use appropriate formatting: Request outputs in formats that match your downstream needs (JSON, markdown, etc.)
  5. Consider chunking strategies: Develop thoughtful approaches to breaking up long documents
  6. Use examples: Provide examples for complex or nuanced tasks
  7. Test with variations: Test your prompts across multiple similar documents to ensure consistency

Conclusion

Effective prompt engineering for document processing requires a thoughtful approach that considers the unique challenges of working with longer texts. By being specific about formats, providing clear instructions, managing context effectively, and encouraging verification, you can significantly improve the accuracy and reliability of LLM-based document processing.

Remember that prompt engineering is an iterative process—continuously refine your prompts based on the results you observe, and maintain a library of effective prompts for recurring document processing tasks.