`pdfplumber` 是一个用于从 PDF 文件中提取文本、表格和元数据的 Python 库

2026/6/16 1:33:39

pdfplumber是一个用于从 PDF 文件中提取文本、表格和元数据的 Python 库，特别擅长处理含复杂布局（如多栏、合并单元格、不规则表格）的 PDF。它基于pdfminer.six，但提供了更友好、更直观的 API。

安装方式：

pipinstallpdfplumber

基本使用示例（提取文本）：

importpdfplumberwithpdfplumber.open("example.pdf")aspdf:full_text=""forpageinpdf.pages:full_text+=page.extract_text()or""# extract_text() 可能返回 Noneprint(full_text)

提取表格示例：

withpdfplumber.open("example.pdf")aspdf:forpageinpdf.pages:tables=page.extract_tables()# 返回列表，每个元素是二维列表（表格）fortableintables:print(table)

注意事项：

中文支持需确保 PDF 内嵌字体且编码正确；若乱码，可尝试设置layout=True或配合pdfplumber.open(..., password="xxx")解密。
性能较慢（逐页解析），大文件建议按需处理页码。
不支持直接编辑或生成 PDF，仅用于提取。

在pdfplumber中，可以通过page.crop(bbox)方法先裁剪页面指定矩形区域（bounding box），再在该子区域内调用extract_text()或extract_words()等方法，从而实现精准提取指定坐标范围内的文本。

✅坐标系说明（重要）：
pdfplumber 使用 PDF 标准坐标系：

原点(0, 0)在左下角；
x向右递增，y向上递增；
bbox = (x0, y0, x1, y1)表示矩形区域，其中：
- x0,y0：左下角横纵坐标；
- x1,y1：右上角横纵坐标；
- 要求x0 < x1且y0 < y1。

📌操作步骤：

打开 PDF，定位目标页；
使用page.crop((x0, y0, x1, y1))获取裁剪后的CroppedPage对象；
在裁剪页上调用extract_text()（支持layout=True/False、keep_blank_chars等参数）；
（可选）用page.debug_tablefinder({})可视化表格区域辅助定位。

🔧 示例代码：

importpdfplumberwithpdfplumber.open("report.pdf")aspdf:page=pdf.pages[0]# 第一页# 示例：提取左上角约 100×50 区域（注意 y 坐标从底向上！）# 假设页面高度为 page.height ≈ 792（如 Letter 尺寸），则顶部区域 y 范围约为 height-50 到 heightbbox=(50,page.height-100,200,page.height-50)# (x0, y0, x1, y1)cropped=page.crop(bbox)text=cropped.extract_text()print("指定区域文本：",textor"[无文本]")

💡 提示：

若不确定坐标，可用page.to_image().draw_rect(bbox).save("debug.png")可视化验证（需安装Pillow和opencv-python或pdfplumber[plot]）；
支持链式调用：page.crop(...).extract_text()；
crop()也适用于extract_tables()、extract_words()等，提升精度和性能。