Python PDF to JSON Conversion for Efficient Data Pre-processing

Converting PDF to JSON is a simple task that uses Python. Converting can be helpful in various pre-processing situations involving data.

Consider this example where the root folder is defined in the bash file. The bash automation efficiently traverses through all documents, triggering the Python program to read the PDF files and convert them into JSON files. This efficient process ensures optimal productivity.

As a result of the conversion process, the PDF files are transformed into JSON files. These JSON files are structured as a JSON array, with each page of the original PDF file represented as an entry in the array.

My GitHub repository Convert PDF to JSON contains example source code. The code below shows the resulting format.

{ 
  "file": "/path/to/pdf_file/file.pdf",
  "pdf_pages" : 
    [{"page":"1","content":"xxx"},
     {"page":"2","content":"yyy"}]
}

This is the small function which does the extraction of the page text.

def extract_pages_from_pdf(pdf_path):
    text = ""
    list = []

    with open(pdf_path, "rb") as file:
        pdf_reader = PyPDF2.PdfReader(file)      
        num_pages = len(pdf_reader.pages)

        for page_num in range(num_pages):
            print(f"***** {page_num} / {num_pages} ****")
            page = pdf_reader.pages[page_num]
            text = page.extract_text()
            print(f"*****\n {text} \n****\n")
            value = { "page": page_num, "content": text}
            list.append(value)
        
        pdf_information = { "file": pdf_path, "pages": list}

That’s all for this blog post.

I hope this was useful to you, and let’s see what’s next.
Greetings,
Thomas

#python, #pdf, #json, #preprocessing

Python PDF to JSON Conversion for Efficient Data Pre-processing

Leave a comment Cancel reply

Blog Stats

Share this:

Related

Leave a comment Cancel reply

Blog Stats