python获取word操作_python操作word文档表格

Ⅰ word图片和文字文混排内容怎么用python读取写入

Python可以利用python-docx模块处理word文档，处理方式是面向对象的。也就是说python-docx模块会把word文档，文档中的段落、文本、字体等都看做对象，对对象进行处理就是对word文档的内容处理。

二，相关概念
如果需要读取word文档中的文字（一般来说，程序也只需要认识word文档中的文字信息），需要先了解python-docx模块的几个概念。

1，Document对象，表示一个word文档。
2，Paragraph对象，表示word文档中的一个段落
3，Paragraph对象的text属性，表示段落中的文本内容。
三，模块的安装和导入
需要注意，python-docx模块安装需要在cmd命令行中输入pip install python-docx，如下图表示安装成功（最后那句英文Successfully installed，成功地安装完成，十分考验英文水平。）

注意在导入模块时，用的是import docx。

也真是奇了怪了，怎么安装和导入模块时，很多都不用一个名字，看来是很有必要出一个python版本的模块管理程序python-maven了，本段纯属PS。

四，读取word文本
在了解了上面的信息之后，就很简单了，下面先创建一个D:\temp\word.docx文件，并在其中输入如下内容。

然后写一段程序，代码及输出结果如下：

#读取docx中的文本代码示例
import docx
#获取文档对象
file=docx.Document("D:\\temp\\word.docx")
print("段落数:"+str(len(file.paragraphs)))#段落数为13，每个回车隔离一段

#输出每一段的内容
for para in file.paragraphs:
print(para.text)

#输出段落编号及段落内容
for i in range(len(file.paragraphs)):
print("第"+str(i)+"段的内容是："+file.paragraphs[i].text)
运行结果：

================ RESTART: F:/360data/重要数据/桌面/学习笔记/readWord.py ================
段落数:13
啊

我看见一座山

雄伟的大山

真高啊

啊

这座山是！

真的很高！
第0段的内容是：啊
第1段的内容是：
第2段的内容是：我看见一座山
第3段的内容是：
第4段的内容是：雄伟的大山
第5段的内容是：
第6段的内容是：真高啊
第7段的内容是：
第8段的内容是：啊
第9段的内容是：
第10段的内容是：这座山是！
第11段的内容是：
第12段的内容是：真的很高！
>>>
总结
以上就是本文关于Python读取word文本操作详解的全部内容，希望对大家有所帮助。感兴趣的朋友可以继续参阅本站其他相关专题，如有不足之处，欢迎留言指出。感谢朋友们对本站的支持！

Ⅱ python操作word文档表格

>>>app=my.Office.Word.GetInstance()
>>>doc=app.Documents[0]
>>>printdoc.Name
VBA工具集.doc
>>>doc.Tables.Count
2
>>>table=doc.Tables[1]
>>>table.Cell(1,1).Select()
>>>app.Selection.MoveEnd(Unit=12,Count=4)
4
>>>app.Selection.Cells.Shading.Texture=-10
>>>

1.my.Office.Word.GetInstance()用win32com得到Word的Application对象的实例

2.我所使用的样本word文件中包含两个Table第二个Table是想要修改的

3.table.Cell(1,1).Select()用于选中这个样表的第一个单元格

4.app.Selection.MoveEnd用于获得向右多选取4个单元格，wdCell=12，用于指示按单元格移动

5.app.Selection.Cells.Shading.Texture = -10用于执行阴影底纹的设置工作，wdTextureDiagonalUp=-10是一个代表斜向右上的底纹样式的常数

Ⅲ Python批量读取加密Word文档转存txt文本实现

# -*- coding:utf-8 -*-

from win32com import client as wc

import os

key = '文档密码'

def Translate(input, output):

# 转换

wordapp = wc.Dispatch('Word.Application')

try:

doc = wordapp.Documents.Open(input, False, False, False,key)

doc.SaveAs(FileName=output, FileFormat=4, Encoding="gb2312")

doc.Close()

print(input, "完成")

os.remove(input)

# 为了让python可以在后续操作中r方式读取txt和不产生乱码，参数为4

except:

print(input,"密码错误")

if __name__ == '__main__':

#docx文档物理路径

path = r"C:Usersdocx"

key = '文档密码'

j=0

for file in os.listdir(path):

if '.doc' in file:

name = file.split(".docx")[0]

#输入文档物理路径

input_file = r"C:Usersdocx"+""+file

#输出文档物理路径

output_file=r"C:Users xt"+""+name+".txt"

Translate(input_file, output_file)

j=j+1

print(j)

else:continue

Ⅳ python如何读取word文件

>>>defPrintAllParagraphs(doc):
count=doc.Paragraphs.Count
foriinrange(count-1,-1,-1):
pr=doc.Paragraphs[i].Range
printpr.Text


>>>app=my.Office.Word.GetInstance()
>>>doc=app.Documents[0]
>>>PrintAllParagraphs(doc)

1.什么是域

域应用基础

>>>

@staticmethod
defGetInstance():
u'''获取Word应用程序的Application对象'''
importwin32com.client
returnwin32com.client.Dispatch('Word.Application')

my.Office.Word.GetInstance的方法实现如上，是一个使用win32com操纵Word Com的接口的封装
所有Paragraph即段落对象，都是通过Paragraph.Range.Text来访问它的文字的

Ⅳ 如何在 linux 上使用 Python 读取 word 文件信息

第一步：获取doc文件的xml组成文件

import zipfiledef get_word_xml(docx_filename):
with open(docx_filename) as f:
zip = zipfile.ZipFile(f)
xml_content = zip.read('word/document.xml')
return xml_content

第二步：解析xml为树形数据结构
from lxml import etreedef get_xml_tree(xml_string):
return etree.fromstring(xml_string)

第三步：读取word内容：
def _itertext(self, my_etree):
"""Iterator to go through xml tree's text nodes"""
for node in my_etree.iter(tag=etree.Element):
if self._check_element_is(node, 't'):
yield (node, node.text)def _check_element_is(self, element, type_char):
word_schema = '99999'
return element.tag == '{%s}%s' % (word_schema,type_char)

Ⅵ 如何在 Linux 上使用 Python 读取 word 文件信息

请注意，所有的程序在它们第一行都是#!/usr/bin/env/python，也就是说，我们想要Python的解释器来执行这些脚本。因此，如果你想你的脚本具有执行性，请使用chmod +x your-script.py，那么你就可以使用./your-script.py来执行它了（在本文中你将会看到这种方式）
探索platform模块
platform模块在标准库中，它有很多运行我们获得众多系统信息的函数。让我们运行Python解释器来探索它们中的一些函数，那就从platform.uname()函数开始吧：
>>> import platform
>>> platform.uname()
('Linux', 'fedora.echorand', '3.7.4-204.fc18.x86_64', '#1 SMP Wed Jan 23 16:44:29 UTC 2013', 'x86_64')

如果你已知道linux上的uname命令，那么你就会认出来这个函数就是这个命令的一个接口。在Python 2上，它会返回一个包含系统类型(或者内核版本)，主机名，版本，发布版本，机器的硬件以及处理器信息元组(tuple)。你可以使用下标访问个别属性，像这样：
>>> platform.uname()[0]
'Linux'
在Python 3上，这个函数返回的是一个命名元组：
>>> platform.uname()

uname_result(system='Linux', node='fedora.echorand',
release='3.7.4-204.fc18.x86_64', version='#1 SMP Wed Jan 23 16:44:29
UTC 2013', machine='x86_64', processor='x86_64')
因为返回结果是一个命名元组，这就可以简单地通过名字来指定特定的属性，而不是必须记住下标，像这样：
>>> platform.uname().system
'Linux'
platform模块还有一些上面属性的直接接口，像这样：
>>> platform.system()
'Linux'
>>> platform.release()
'3.7.4-204.fc18.x86_64'

Ⅶ 如何在 Linux 上使用 Python 读取 word 文件信息

首先下载安装win32com
from win32com import client as wc
word = wc.Dispatch('Word.Application')
doc = word.Documents.Open('c:/test')
doc.SaveAs('c:/test.text', 2)
doc.Close()
word.Quit()

这种方式产生的text文档，不能用python用普通的r方式读取，为了让python可以用r方式读取，应当写成

doc.SaveAs('c:/test', 4)

注意：系统执行完成后，会自动产生文件后缀txt（虽然没有指明后缀）。
在xp系统下面，应当
open(r'c:\text','r')
wdFormatDocument = 0
wdFormatDocument97 = 0
wdFormatDocumentDefault = 16
wdFormatDOSText = 4
wdFormatDOSTextLineBreaks = 5
wdFormatEncodedText = 7
wdFormatFilteredHTML = 10
wdFormatFlatXML = 19
wdFormatFlatXMLMacroEnabled = 20
wdFormatFlatXMLTemplate = 21
= 22
wdFormatHTML = 8
wdFormatPDF = 17
wdFormatRTF = 6
wdFormatTemplate = 1
wdFormatTemplate97 = 1
wdFormatText = 2
wdFormatTextLineBreaks = 3
wdFormatUnicodeText = 7
wdFormatWebArchive = 9
wdFormatXML = 11
wdFormatXMLDocument = 12
= 13
wdFormatXMLTemplate = 14
= 15
wdFormatXPS = 18

照着字面意思应该能对应到相应的文件格式，如果你是office 2003可能支持不了这么多格式。word文件转html有两种格式可选wdFormatHTML、wdFormatFilteredHTML（对应数字 8、10），区别是如果是wdFormatHTML格式的话，word文件里面的公式等ole对象将会存储成wmf格式，而选用 wdFormatFilteredHTML的话公式图片将存储为gif格式，而且目测可以看出用wdFormatFilteredHTML生成的HTML 明显比wdFormatHTML要干净许多。
当然你也可以用任意一种语言通过com来调用office API，比如PHP.
from win32com import client as wc
word = wc.Dispatch('Word.Application')
doc = word.Documents.Open(r'c:/test1.doc')
doc.SaveAs('c:/test1.text', 4)
doc.Close()
import re
strings=open(r'c:\test1.text','r').read()
result=re.findall('\(\s*[A-D]\s*\)|\(\xa1*[A-D]\xa1*\)|\（\s*[A-D]\s*\）|\（\xa1*[A-D]\xa1*\）',strings)
chan=re.sub('\(\s*[A-D]\s*\)|\(\xa1*[A-D]\xa1*\)|\（\s*[A-D]\s*\）|\（\xa1*[A-D]\xa1*\）','()',strings)
question=open(r'c:\question','a+')
question.write(chan)
question.close()
answer=open(r'c:\answeronly','a+')
for i,a in enumerate(result):
m=re.search('[A-D]',a)
answer.write(str(i+1)+' '+m.group()+'\n')
answer.close()
chan=re.sub(r'\xa3\xa8\s*[A-D]\s*\xa3\xa9','()',strings)
#不要()，容易引起歧义。

Ⅷ python读取word文档内容

import fnmatch, os, sys, win32com.client

readpath=r'D:\123'

wordapp = win32com.client.gencache.EnsureDispatch("Word.Application")
try:
for path, dirs, files in os.walk(readpath):
for filename in files:
if not fnmatch.fnmatch(filename, '*.docx'):continue
doc = os.path.abspath(os.path.join(path,filename))
print 'processing %s...' % doc
wordapp.Documents.Open(doc)
docastext = doc[:-4] + 'txt'
wordapp.ActiveDocument.SaveAs(docastext,FileFormat=win32com.client.constants.wdFormatText)
wordapp.ActiveDocument.Close()
finally:
wordapp.Quit()
print 'end'

f=open(r'd:\123\test.txt','r')
for line in f.readlines():
print line.decode('gbk')
f.close()

Ⅸ python设置word文档格式内容

import docx
doc=docx.Document()
整数 0 表示标题是 Title 样式，这用于文档的顶部。整数 1 到 45是不同的标题层次，是主要的标题， 45是最低层的子标题。
doc.add_heading('标题0',0)
doc.add_heading('标题1',1)
doc.add_heading('标题2',2)
doc.add_heading('标题3',3)
doc.add_heading('标题4',4)
doc.add_heading('标题5',5)
doc.save('example3.docx')

1# 添加内容
paragraph = doc_.add_paragraph()
run_ = paragraph.add_run("Python 博客")

2# 获取字体对象
font_ = run_.font

3# 设置下划线
font_.underline = True

4# 设置加粗
font_.bold = True

5# 设置字体颜色
font_.color.rgb = RGBColor(0xFF,0x00,0x00)

6# 设置字体大小
font_.size = Pt(20)

7# 获取段落格式
paragraph_format = paragraph.paragraph_format

8# 设置首行缩进
paragraph_format.first_line_indent = Inches(0.2)

9# 设置段前距，单位为英镑
paragraph_format.space_after = Pt(10)

10# 设置段后距，单位为英镑
paragraph_format.space_before = Pt(5)

11# 添加表格
table_ = doc_.add_table(rows=2, cols=2, style="Medium Grid 1 Accent 1")

12# 填写第一行第一列内容
table_.cell(0,0).text ="

13# 填写第一行第二列内容
table_.cell(0,1).text =""

14# 填写第二行第一列内容
table_.cell(1,0).text ="描述"

15# 填写第二行第二列内容
table_.cell(1,1).text =""

16# 添加图片、width 属性设置大小
doc_.add_picture(r"/usr/load/download/test.png", width=Inches(4.25))

17# 保存文档
doc_.save('Python--Word 内容格式.docx')

Ⅹ 如何在 Linux 上使用 Python 读取 word 文件信息

必须说明：不同于Illustrator、InDesign、CorelDRAW、OpenOffice DRAW、Incscape等工具，Word是流动分页的，文件内容本身并不存储分页结果。具体分页时断在哪里、最后分出多少页，都需要现场渲染所有的图文内容之后才能确定。
（简而言之就是：Word文件中仅包含了一行一行的文本，与页面设置中指定的页面尺寸。Word每次打开文件时都会一行一行“摆放”文本数据，发现一页装不下了自动新开一页。当然真正的Word渲染引擎肯定有更复杂的行为。）
从.doc/.docx文件中直接读出页面数量，这本身就是个伪命题。所以千万别在“直接读取页面数量”这个方向上寻求方案——软件开发的技法不好可以改正，但路线错了必死无疑！
你需要调动一套能够真的把Word文件的内容渲染出来的工具（支持二次开发的）。只有把Word文件的所有内容渲染成为可以观看的图形，才能准确得知页面的总数。在Linux上很可能LibreOffice可以吧。而在Windows上就当然是用Word本身了。
注意Word的分页结论是没有保证的。缺少字体、字形不同、软件环境不同等各种原因，都会造成不同电脑上打开同一个Word文件的页数不一致。这一点对服务器也没有例外。得到了页数也只能参考使用，而不要100%信赖。

导航:首页 > 编程语言 > python获取word操作

python获取word操作

与python获取word操作相关的资料