python协整分析_如何利用python进行数据分析

1. python中时间序列数据的一些处理方式

datetime.timedelta对象代表两个时间之间的时间差，两个date或datetime对象相减就可以返回一个timedelta对象。
利用以下数据进行说明：

如果我们发现时间相关内容的变量为int,float,str等类型，不方便后面的分析，就需要使用该函数转化为常用的时间变量格式：pandas.to_datetime

转换得到的时间单位如下：

如果时间序列格式不统一，pd.to_datetime()的处理方式：

当然，正确的转换是这样的：

第一步：to_datetime()
第二步：astype(datetime64[D]),astype(datetime64[M])

本例中：

order_dt_diff必须是Timedelta(Ɔ days 00:00:00')格式，可能是序列使用了diff()
或者pct_change()。

前者往往要通过'/np.timedelta'去掉单位days。后者其实没有单位。

假如我们要统计某共享单车一天内不同时间点的用户使用数据，例如

还有其他维度的提取，年、月、日、周，参见：
Datetime properties

注意：.dt的对象必须为pandas.Series,而不可以是Series中的单个元素

2. 可以让你快速用Python进行数据分析的10个小技巧

一些小提示和小技巧可能是非常有用的，特别是在编程领域。有时候使用一点点黑客技术，既可以节省时间，还可能挽救“生命”。

一个小小的快捷方式或附加组件有时真是天赐之物，并且可以成为真正的生产力助推器。所以，这里有一些小提示和小技巧，有些可能是新的，但我相信在下一个数据分析项目中会让你非常方便。

Pandas中数据框数据的Profiling过程

Profiling（分析器）是一个帮助我们理解数据的过程，而Pandas Profiling是一个Python包，它可以简单快速地对Pandas 的数据框数据进行探索性数据分析。

Pandas中df.describe()和df.info()函数可以实现EDA过程第一步。但是，它们只提供了对数据非常基本的概述，对于大型数据集没有太大帮助。而Pandas中的Profiling功能简单通过一行代码就能显示大量信息，且在交互式HTML报告中也是如此。

对于给定的数据集，Pandas中的profiling包计算了以下统计信息：

由Pandas Profiling包计算出的统计信息包括直方图、众数、相关系数、分位数、描述统计量、其他信息——类型、单一变量值、缺失值等。

安装

用pip安装或者用conda安装

pip install pandas-profiling

conda install -c anaconda pandas-profiling

用法

下面代码是用很久以前的泰坦尼克数据集来演示多功能Python分析器的结果。

#importing the necessary packages

import pandas as pd

import pandas_profiling

df = pd.read_csv('titanic/train.csv')

pandas_profiling.ProfileReport(df)

一行代码就能实现在Jupyter Notebook中显示完整的数据分析报告，该报告非常详细，且包含了必要的图表信息。

还可以使用以下代码将报告导出到交互式HTML文件中。

profile = pandas_profiling.ProfileReport(df)

profile.to_file(outputfile="Titanic data profiling.html")

Pandas实现交互式作图

Pandas有一个内置的.plot（）函数作为DataFrame类的一部分。但是，使用此功能呈现的可视化不是交互式的，这使得它没那么吸引人。同样，使用pandas.DataFrame.plot（）函数绘制图表也不能实现交互。如果我们需要在不对代码进行重大修改的情况下用Pandas绘制交互式图表怎么办呢？这个时候就可以用Cufflinks库来实现。

Cufflinks库可以将有强大功能的plotly和拥有灵活性的pandas结合在一起，非常便于绘图。下面就来看在pandas中如何安装和使用Cufflinks库。

安装

pip install plotly

# Plotly is a pre-requisite before installing cufflinks

pip install cufflinks

用法

#importing Pandas

import pandas as pd

#importing plotly and cufflinks in offline mode

import cufflinks as cf

import plotly.offline

cf.go_offline()

cf.set_config_file(offline=False, world_readable=True)

是时候展示泰坦尼克号数据集的魔力了。

df.iplot()

df.iplot() vs df.plot()

右侧的可视化显示了静态图表，而左侧图表是交互式的，更详细，并且所有这些在语法上都没有任何重大更改。

Magic命令

Magic命令是Jupyter notebook中的一组便捷功能，旨在解决标准数据分析中的一些常见问题。使用命令％lsmagic可以看到所有的可用命令。

所有可用的Magic命令列表

Magic命令有两种：行magic命令（line magics），以单个％字符为前缀，在单行输入操作；单元magic命令（cell magics），以双%%字符为前缀，可以在多行输入操作。如果设置为1，则不用键入%即可调用Magic函数。

接下来看一些在常见数据分析任务中可能用到的命令：

% pastebin

％pastebin将代码上传到Pastebin并返回url。Pastebin是一个在线内容托管服务，可以存储纯文本，如源代码片段，然后通过url可以与其他人共享。事实上，Github gist也类似于pastebin，只是有版本控制。

在file.py文件中写一个包含以下内容的python脚本，并试着运行看看结果。

#file.py

def foo(x):

return x

在Jupyter Notebook中使用％pastebin生成一个pastebin url。

%matplotlib notebook

函数用于在Jupyter notebook中呈现静态matplotlib图。用notebook替换inline，可以轻松获得可缩放和可调整大小的绘图。但记得这个函数要在导入matplotlib库之前调用。

%run

用％run函数在notebook中运行一个python脚本试试。

%run file.py

%%writefile

%% writefile是将单元格内容写入文件中。以下代码将脚本写入名为foo.py的文件并保存在当前目录中。

%%latex

%%latex函数将单元格内容以LaTeX形式呈现。此函数对于在单元格中编写数学公式和方程很有用。

查找并解决错误

交互式调试器也是一个神奇的功能，我把它单独定义了一类。如果在运行代码单元时出现异常，请在新行中键入％debug并运行它。这将打开一个交互式调试环境，它能直接定位到发生异常的位置。还可以检查程序中分配的变量值，并在此处执行操作。退出调试器单击q即可。

Printing也有小技巧

如果您想生成美观的数据结构，pprint是首选。它在打印字典数据或JSON数据时特别有用。接下来看一个使用print和pprint来显示输出的示例。

让你的笔记脱颖而出

我们可以在您的Jupyter notebook中使用警示框/注释框来突出显示重要内容或其他需要突出的内容。注释的颜色取决于指定的警报类型。只需在需要突出显示的单元格中添加以下任一代码或所有代码即可。

蓝色警示框：信息提示

Tip: Use blue boxes (alert-info) for tips and notes.

If it’s a note, you don’t have to include the word “Note”.

黄色警示框：警告

Example: Yellow Boxes are generally used to include additional examples or mathematical formulas.

绿色警示框：成功

Use green box only when necessary like to display links to related content.

红色警示框：高危

It is good to avoid red boxes but can be used to alert users to not delete some important part of code etc.

打印单元格所有代码的输出结果

假如有一个Jupyter Notebook的单元格，其中包含以下代码行：

In [1]: 10+5

11+6

Out [1]: 17

单元格的正常属性是只打印最后一个输出，而对于其他输出，我们需要添加print()函数。然而通过在notebook顶部添加以下代码段可以一次打印所有输出。

添加代码后所有的输出结果就会一个接一个地打印出来。

In [1]: 10+5

11+6

12+7

Out [1]: 15

Out [1]: 17

Out [1]: 19

恢复原始设置：

InteractiveShell.ast_node_interactivity = "last_expr"

使用'i'选项运行python脚本

从命令行运行python脚本的典型方法是：python hello.py。但是，如果在运行相同的脚本时添加-i，例如python -i hello.py，就能提供更多优势。接下来看看结果如何。

首先，即使程序结束，python也不会退出解释器。因此，我们可以检查变量的值和程序中定义的函数的正确性。

其次，我们可以轻松地调用python调试器，因为我们仍然在解释器中：

import pdb

pdb.pm()

这能定位异常发生的位置，然后我们可以处理异常代码。

自动评论代码

Ctrl / Cmd + /自动注释单元格中的选定行，再次命中组合将取消注释相同的代码行。

删除容易恢复难

你有没有意外删除过Jupyter notebook中的单元格？如果答案是肯定的，那么可以掌握这个撤消删除操作的快捷方式。

如果您删除了单元格的内容，可以通过按CTRL / CMD + Z轻松恢复它。

如果需要恢复整个已删除的单元格，请按ESC + Z或EDIT>撤消删除单元格。

结论

在本文中，我列出了使用Python和Jupyter notebook时收集的一些小提示。我相信它们会对你有用，能让你有所收获，从而实现轻松编码！

3. python数据分析时间序列如何提取一个月的数据

python做数据分析时下面就是提取一个月数据的教程1. datetime库

1.1 datetime.date
1） datetime.date.today() 返回今日，输出的类型为date类

import datetime
today = datetime.date.today()
print(today)
print(type(today))
–> 输出的结果为：

2020-03-04
<class 'datetime.date'>
将输出的结果转化为常见数据类型（字符串）

print(str(today))
print(type(str(today)))
date = str(today).split('-')
year,month,day = date[0],date[1],date[2]
print('今日的年份是{}年,月份是{}月,日子是{}号'.format(year,month,day))
–> 输出的结果为：(转化为字符串之后就可以直接进行操作)

2020-03-04
<class 'str'>
今日的年份是2020年,月份是03月,日子是04号
2） datetime.date(年,月,日)，获取当前的日期

date = datetime.date(2020,2,29)
print(date)
print(type(date))
–> 输出的结果为：

2020-02-29
<class 'datetime.date'>
1.2 芹喊datetime.datetime
1） datetime.datetime.now()输出当前时间，datetime类

now = datetime.datetime.now()
print(now)
print(type(now))
–> 输出的结果为：(注意秒后面有个不确定尾数)

2020-03-04 09:02:28.280783
<class 'datetime.datetime'>
可通过str()转化为字符串（和上面类似）

print(str(now))
print(type(str(now)))
–> 输出的结果为：（这里也可以跟上面的处理类似分别获得相应的数据，但是也可以使用下面更直接的方法来获取）

2020-03-04 09:04:32.271075
<class 'str'>
2）通过自带的方法获取年月日，时分秒（这里返回的是int整型数据，注意区别）

now = datetime.datetime.now()
print(now.year,type(now.year))
print(now.month,type(now.month))
print(now.day,type(now.day))
print(now.hour,type(now.hour))
print(now.minute,type(now.minute))
print(now.second,type(now.second))
print(now.date(),type(now.date()))
print(now.date().year,type(now.date().year))
–> 输出的结果为：（首先注意输出中倒数第二个还是上面的纯档datetime.date对象，这里是用来做时间对比的，同时除了这里的datetime.datetime有这种方法，datetime.date对象也有。因为此方法获取second是取的整型数据，自然最后的不确定尾数就被取整处理掉了）

2020 <class 'int'>
3 <class 'int'>
4 <class 'int'>
9 <class 'int'>
12 <class '做首乱int'>
55 <class 'int'>
2020-03-04 <class 'datetime.date'>
2020 <class 'int'>

4. 如何利用python进行数据分析

作者Wes McKinney是pandas库的主要作者，所以本书也可以作为利用Python实现数据密集型应用的科学计算实践指南。本书适合刚刚接触Python的分析人员以及刚刚接触科学计算的Python程序员。
•将IPython这个交互式Shell作为你的首要开发环境。
•学习NumPy（Numerical Python）的基础和高级知识。
•从pandas库的数据分析工具开始。
•利用高性能工具对数据进行加载、清理、转换、合并以及重塑。
•利用matplotlib创建散点图以及静态或交互式的可视化结果。
•利用pandas的groupby功能对数据集进行切片、切块和汇总操作。
•处理各种各样的时间序列数据。
•通过详细的案例学习如何解决Web分析、社会科学、金融学以及经•济学等领域的问题。

5. 如何使用python做统计分析

Shape Parameters
形态参数
While a general continuous random variable can be shifted and scaled
with the loc and scale parameters, some distributions require additional
shape parameters. For instance, the gamma distribution, with density
γ(x,a)=λ(λx)a−1Γ(a)e−λx,
requires the shape parameter a. Observe that setting λ can be obtained by setting the scale keyword to 1/λ.
虽然一个一般的连续随机变量可以被位移和伸缩通过loc和scale参数，但一些分布还需要额外的形态参数。作为例子，看到这个伽马分布，这是它的密度函数
γ(x,a)=λ(λx)a−1Γ(a)e−λx,
要求一个形态参数a。注意到λ的设置可以通过设置scale关键字为1/λ进行。
Let’s check the number and name of the shape parameters of the gamma
distribution. (We know from the above that this should be 1.)
让我们检查伽马分布的形态参数的名字的数量。（我们知道从上面知道其应该为1）
>>>
>>> from scipy.stats import gamma
>>> gamma.numargs
1
>>> gamma.shapes
'a'

Now we set the value of the shape variable to 1 to obtain the
exponential distribution, so that we compare easily whether we get the
results we expect.
现在我们设置形态变量的值为1以变成指数分布。所以我们可以容易的比较是否得到了我们所期望的结果。
>>>
>>> gamma(1, scale=2.).stats(moments="mv")
(array(2.0), array(4.0))

Notice that we can also specify shape parameters as keywords:
注意我们也可以以关键字的方式指定形态参数：
>>>
>>> gamma(a=1, scale=2.).stats(moments="mv")
(array(2.0), array(4.0))

Freezing a Distribution
冻结分布
Passing the loc and scale keywords time and again can become quite
bothersome. The concept of freezing a RV is used to solve such problems.
不断地传递loc与scale关键字最终会让人厌烦。而冻结RV的概念被用来解决这个问题。
>>>
>>> rv = gamma(1, scale=2.)

By using rv we no longer have to include the scale or the shape
parameters anymore. Thus, distributions can be used in one of two ways,
either by passing all distribution parameters to each method call (such
as we did earlier) or by freezing the parameters for the instance of the
distribution. Let us check this:
通过使用rv我们不用再更多的包含scale与形态参数在任何情况下。显然，分布可以被多种方式使用，我们可以通过传递所有分布参数给对方法的每次调用（像我们之前做的那样）或者可以对一个分布对象冻结参数。让我们看看是怎么回事：
>>>
>>> rv.mean(), rv.std()
(2.0, 2.0)

This is indeed what we should get.
这正是我们应该得到的。
Broadcasting
广播
The basic methods pdf and so on satisfy the usual numpy broadcasting
rules. For example, we can calculate the critical values for the upper
tail of the t distribution for different probabilites and degrees of
freedom.
像pdf这样的简单方法满足numpy的广播规则。作为例子，我们可以计算t分布的右尾分布的临界值对于不同的概率值以及自由度。
>>>
>>> stats.t.isf([0.1, 0.05, 0.01], [[10], [11]])
array([[ 1.37218364, 1.81246112, 2.76376946],
[ 1.36343032, 1.79588482, 2.71807918]])

Here, the first row are the critical values for 10 degrees of freedom
and the second row for 11 degrees of freedom (d.o.f.). Thus, the
broadcasting rules give the same result of calling isf twice:
这里，第一行是以10自由度的临界值，而第二行是以11为自由度的临界值。所以，广播规则与下面调用了两次isf产生的结果相同。
>>>
>>> stats.t.isf([0.1, 0.05, 0.01], 10)
array([ 1.37218364, 1.81246112, 2.76376946])
>>> stats.t.isf([0.1, 0.05, 0.01], 11)
array([ 1.36343032, 1.79588482, 2.71807918])

If the array with probabilities, i.e, [0.1, 0.05, 0.01] and the array of
degrees of freedom i.e., [10, 11, 12], have the same array shape, then
element wise matching is used. As an example, we can obtain the 10% tail
for 10 d.o.f., the 5% tail for 11 d.o.f. and the 1% tail for 12 d.o.f.
by calling
但是如果概率数组，如[0.1,0.05,0.01]与自由度数组,如[10,11,12]具有相同的数组形态，则元素对应捕捉被作用，我们可以分别得到10%，5%，1%尾的临界值对于10，11,12的自由度。
>>>
>>> stats.t.isf([0.1, 0.05, 0.01], [10, 11, 12])
array([ 1.37218364, 1.79588482, 2.68099799])

Specific Points for Discrete Distributions
离散分布的特殊之处
Discrete distribution have mostly the same basic methods as the
continuous distributions. However pdf is replaced the probability mass
function pmf, no estimation methods, such as fit, are available, and
scale is not a valid keyword parameter. The location parameter, keyword
loc can still be used to shift the distribution.
离散分布的简单方法大多数与连续分布很类似。当然像pdf被更换为密度函数pmf，没有估计方法，像fit是可用的。而scale不是一个合法的关键字参数。Location参数，关键字loc则仍然可以使用用于位移。
The computation of the cdf requires some extra attention. In the case of
continuous distribution the cumulative distribution function is in most
standard cases strictly monotonic increasing in the bounds (a,b) and
has therefore a unique inverse. The cdf of a discrete distribution,
however, is a step function, hence the inverse cdf, i.e., the percent
point function, requires a different definition:
ppf(q) = min{x : cdf(x) >= q, x integer}

Cdf的计算要求一些额外的关注。在连续分布的情况下，累积分布函数在大多数标准情况下是严格递增的，所以有唯一的逆。而cdf在离散分布，无论如何，是阶跃函数，所以cdf的逆，分位点函数，要求一个不同的定义：
ppf(q) = min{x : cdf(x) >= q, x integer}
For further info, see the docs here.
为了更多信息可以看这里。
We can look at the hypergeometric distribution as an example
>>>
>>> from scipy.stats import hypergeom
>>> [M, n, N] = [20, 7, 12]

我们可以看这个超几何分布的例子
>>>
>>> from scipy.stats import hypergeom
>>> [M, n, N] = [20, 7, 12]

If we use the cdf at some integer points and then evaluate the ppf at
those cdf values, we get the initial integers back, for example
如果我们使用在一些整数点使用cdf，它们的cdf值再作用ppf会回到开始的值。
>>>
>>> x = np.arange(4)*2
>>> x
array([0, 2, 4, 6])
>>> prb = hypergeom.cdf(x, M, n, N)
>>> prb
array([ 0.0001031991744066, 0.0521155830753351, 0.6083591331269301,
0.9897832817337386])
>>> hypergeom.ppf(prb, M, n, N)
array([ 0., 2., 4., 6.])

If we use values that are not at the kinks of the cdf step function, we get the next higher integer back:
如果我们使用的值不是cdf的函数值，则我们得到一个更高的值。
>>>
>>> hypergeom.ppf(prb + 1e-8, M, n, N)
array([ 1., 3., 5., 7.])
>>> hypergeom.ppf(prb - 1e-8, M, n, N)
array([ 0., 2., 4., 6.])

6. python 时间序列分析收敛性问题

Python与R相比速度要快。Python可以直接处理上G的数据；R不行，R分析数据时需要先通过数据库把大数据转化为小数据（通过groupby）才能交给R做分析，因此R不可能直接分析行为详单，只能分析统计结果。所以有人说：Python=R+SQL/Hive，并不是没有道理的。

导航:首页 > 编程语言 > python协整分析

python协整分析

与python协整分析相关的资料