word文档解析python_word图片和文字文混排内容怎么用python读取写入

㈠ python读取word每一行

Python学习笔记(28) - Python读取word文本 - 程序员大阳的博客...

1. 简介 Python可以利用python-docx模块处理word文档,处理方式是面向对象的。也就是说python-docx模块会把word文档,文档中的段落、文本、字体等都看做对象,
2. 相关概念如果需要读取

㈡求助大神：如何用Python docx解析一个Word文档，在某些字段处插入文本或表格，更换页眉页脚等急~

from docx import Document
from docx.shared import Inches

document = Document()

document.add_heading('Document Title', 0)

p = document.add_paragraph('A plain paragraph having some ')
p.add_run('bold').bold = True
p.add_run(' and some ')
p.add_run('italic.').italic = True

document.add_heading('Heading, level 1', level=1)
document.add_paragraph('Intense quote', style='IntenseQuote')

document.add_paragraph(
'first item in unordered list', style='ListBullet'
)
document.add_paragraph(
'first item in ordered list', style='ListNumber'
)

document.add_picture('monty-truth.png', width=Inches(1.25))

table = document.add_table(rows=1, cols=3)
hdr_cells = table.rows[0].cells
hdr_cells[0].text = 'Qty'
hdr_cells[1].text = 'Id'
hdr_cells[2].text = 'Desc'
for item in recordset:
row_cells = table.add_row().cells
row_cells[0].text = str(item.qty)
row_cells[1].text = str(item.id)
row_cells[2].text = item.desc

document.add_page_break()

document.save('demo.docx')
这是一个demo for docx 你可以试试

㈢ word图片和文字文混排内容怎么用python读取写入

Python可以利用python-docx模块处理word文档，处理方式是面向对象的。也就是说python-docx模块会把word文档，文档中的段落、文本、字体等都看做对象，对对象进行处理就是对word文档的内容处理。

二，相关概念
如果需要读取word文档中的文字（一般来说，程序也只需要认识word文档中的文字信息），需要先了解python-docx模块的几个概念。

1，Document对象，表示一个word文档。
2，Paragraph对象，表示word文档中的一个段落
3，Paragraph对象的text属性，表示段落中的文本内容。
三，模块的安装和导入
需要注意，python-docx模块安装需要在cmd命令行中输入pip install python-docx，如下图表示安装成功（最后那句英文Successfully installed，成功地安装完成，十分考验英文水平。）

注意在导入模块时，用的是import docx。

也真是奇了怪了，怎么安装和导入模块时，很多都不用一个名字，看来是很有必要出一个python版本的模块管理程序python-maven了，本段纯属PS。

四，读取word文本
在了解了上面的信息之后，就很简单了，下面先创建一个D:\temp\word.docx文件，并在其中输入如下内容。

然后写一段程序，代码及输出结果如下：

#读取docx中的文本代码示例
import docx
#获取文档对象
file=docx.Document("D:\\temp\\word.docx")
print("段落数:"+str(len(file.paragraphs)))#段落数为13，每个回车隔离一段

#输出每一段的内容
for para in file.paragraphs:
print(para.text)

#输出段落编号及段落内容
for i in range(len(file.paragraphs)):
print("第"+str(i)+"段的内容是："+file.paragraphs[i].text)
运行结果：

================ RESTART: F:/360data/重要数据/桌面/学习笔记/readWord.py ================
段落数:13
啊

我看见一座山

雄伟的大山

真高啊

啊

这座山是！

真的很高！
第0段的内容是：啊
第1段的内容是：
第2段的内容是：我看见一座山
第3段的内容是：
第4段的内容是：雄伟的大山
第5段的内容是：
第6段的内容是：真高啊
第7段的内容是：
第8段的内容是：啊
第9段的内容是：
第10段的内容是：这座山是！
第11段的内容是：
第12段的内容是：真的很高！
>>>
总结
以上就是本文关于Python读取word文本操作详解的全部内容，希望对大家有所帮助。感兴趣的朋友可以继续参阅本站其他相关专题，如有不足之处，欢迎留言指出。感谢朋友们对本站的支持！

㈣如何在 Linux 上使用 Python 读取 word 文件信息

第一步：获取doc文件的xml组成文件

import zipfiledef get_word_xml(docx_filename):
with open(docx_filename) as f:
zip = zipfile.ZipFile(f)
xml_content = zip.read('word/document.xml')
return xml_content

第二步：解析xml为树形数据结构
from lxml import etreedef get_xml_tree(xml_string):
return etree.fromstring(xml_string)

第三步：读取word内容：
def _itertext(self, my_etree):
"""Iterator to go through xml tree's text nodes"""
for node in my_etree.iter(tag=etree.Element):
if self._check_element_is(node, 't'):
yield (node, node.text)def _check_element_is(self, element, type_char):
word_schema = '99999'
return element.tag == '{%s}%s' % (word_schema,type_char)

㈤如何在 Linux 上使用 Python 读取 word 文件信息

首先下载安装win32com
from win32com import client as wc
word = wc.Dispatch('Word.Application')
doc = word.Documents.Open('c:/test')
doc.SaveAs('c:/test.text', 2)
doc.Close()
word.Quit()

这种方式产生的text文档，不能用python用普通的r方式读取，为了让python可以用r方式读取，应当写成

doc.SaveAs('c:/test', 4)

注意：系统执行完成后，会自动产生文件后缀txt（虽然没有指明后缀）。
在xp系统下面，应当
open(r'c:\text','r')
wdFormatDocument = 0
wdFormatDocument97 = 0
wdFormatDocumentDefault = 16
wdFormatDOSText = 4
wdFormatDOSTextLineBreaks = 5
wdFormatEncodedText = 7
wdFormatFilteredHTML = 10
wdFormatFlatXML = 19
wdFormatFlatXMLMacroEnabled = 20
wdFormatFlatXMLTemplate = 21
= 22
wdFormatHTML = 8
wdFormatPDF = 17
wdFormatRTF = 6
wdFormatTemplate = 1
wdFormatTemplate97 = 1
wdFormatText = 2
wdFormatTextLineBreaks = 3
wdFormatUnicodeText = 7
wdFormatWebArchive = 9
wdFormatXML = 11
wdFormatXMLDocument = 12
= 13
wdFormatXMLTemplate = 14
= 15
wdFormatXPS = 18

照着字面意思应该能对应到相应的文件格式，如果你是office 2003可能支持不了这么多格式。word文件转html有两种格式可选wdFormatHTML、wdFormatFilteredHTML（对应数字 8、10），区别是如果是wdFormatHTML格式的话，word文件里面的公式等ole对象将会存储成wmf格式，而选用 wdFormatFilteredHTML的话公式图片将存储为gif格式，而且目测可以看出用wdFormatFilteredHTML生成的HTML 明显比wdFormatHTML要干净许多。
当然你也可以用任意一种语言通过com来调用office API，比如PHP.
from win32com import client as wc
word = wc.Dispatch('Word.Application')
doc = word.Documents.Open(r'c:/test1.doc')
doc.SaveAs('c:/test1.text', 4)
doc.Close()
import re
strings=open(r'c:\test1.text','r').read()
result=re.findall('\(\s*[A-D]\s*\)|\(\xa1*[A-D]\xa1*\)|\（\s*[A-D]\s*\）|\（\xa1*[A-D]\xa1*\）',strings)
chan=re.sub('\(\s*[A-D]\s*\)|\(\xa1*[A-D]\xa1*\)|\（\s*[A-D]\s*\）|\（\xa1*[A-D]\xa1*\）','()',strings)
question=open(r'c:\question','a+')
question.write(chan)
question.close()
answer=open(r'c:\answeronly','a+')
for i,a in enumerate(result):
m=re.search('[A-D]',a)
answer.write(str(i+1)+' '+m.group()+'\n')
answer.close()
chan=re.sub(r'\xa3\xa8\s*[A-D]\s*\xa3\xa9','()',strings)
#不要()，容易引起歧义。

㈥如何用python读取word

使用Python的内部方法open()读取文本文件

try:
f=open('/file','r')
print(f.read())
finally:
iff:
f.close()

如果读取word文档推荐使用第三方插件，python-docx 可以在官网上下载

使用方式

#-*-coding:cp936-*-
importdocx
document=docx.Document(文件路径)
docText='

'.join([
paragraph.text.encode('utf-8')forparagraphindocument.paragraphs
])
printdocText

㈦ Python批量读取加密Word文档转存txt文本实现

# -*- coding:utf-8 -*-

from win32com import client as wc

import os

key = '文档密码'

def Translate(input, output):

# 转换

wordapp = wc.Dispatch('Word.Application')

try:

doc = wordapp.Documents.Open(input, False, False, False,key)

doc.SaveAs(FileName=output, FileFormat=4, Encoding="gb2312")

doc.Close()

print(input, "完成")

os.remove(input)

# 为了让python可以在后续操作中r方式读取txt和不产生乱码，参数为4

except:

print(input,"密码错误")

if __name__ == '__main__':

#docx文档物理路径

path = r"C:Usersdocx"

key = '文档密码'

j=0

for file in os.listdir(path):

if '.doc' in file:

name = file.split(".docx")[0]

#输入文档物理路径

input_file = r"C:Usersdocx"+""+file

#输出文档物理路径

output_file=r"C:Users xt"+""+name+".txt"

Translate(input_file, output_file)

j=j+1

print(j)

else:continue

㈧ python如何获取word文件中某个关键字之后的表格

最好是全部都读取到程序中，在程序中进行判断。

本文实例讲述了Python实现批量读取word中表格信息的方法。分享给大家供大家参考。具体如下：
单位收集了很多word格式的调查表，领导需要收集表单里的信息，我就把所有调查表放一个文件里，写了个python小程序把所需的信息打印出来
#coding:utf-8
import os
import win32com
from win32com.client import Dispatch, constants
from docx import Document
def parse_doc(f):
"""读取doc，返回姓名和行业
"""
doc = w.Documents.Open( FileName = f )
t = doc.Tables[0] # 根据文件中的图表选择信息
name = t.Rows[0].Cells[1].Range.Text
situation = t.Rows[0].Cells[5].Range.Text
people = t.Rows[1].Cells[1].Range.Text
title = t.Rows[1].Cells[3].Range.Text
print name, situation, people,title
doc.Close()
def parse_docx(f):
"""读取docx，返回姓名和行业
"""
d = Document(f)
t = d.tables[0]
name = t.cell(0,1).text
situation = t.cell(0,8).text
people = t.cell(1,2).text
title = t.cell(1,8).text
print name, situation, people,title
if __name__ == "__main__":
w = win32com.client.Dispatch('Word.Application')
# 遍历文件
PATH = "H:\work\\aaa" # windows文件路径
doc_files = os.listdir(PATH)
for doc in doc_files:
if os.path.splitext(doc)[1] == '.docx':
try:
parse_docx(PATH+'\\'+doc)
except Exception as e:
print e
elif os.path.splitext(doc)[1] == '.doc':
try:
parse_doc(PATH+'\\'+doc)
except Exception as e:
print e

希望本文所述对大家的Python程序设计有所帮助。

导航:首页 > 编程语言 > word文档解析python

word文档解析python

与word文档解析python相关的资料