FormatImporter

1. FormatImporter 概述

FormatImporter 用于将一些通用格式外部数据导入神策分析进行使用，目前支持导入 csv 格式数据，导入 nginx 的日志，导入 mysql 里面的数据, 导入 oracle 里面的数据, 以及导入符合神策要求格式的json日志。

下载请点击此链接, 脚本下载后是一个 tgz 压缩包，注意运行脚本需要 python3.4 + 。另外如果使用了 mysql/oracle 导入工具需要确保机器上包含相关客户端的程序包。

wget 'http://update.sensorsdata.cn:8021/release/format_importer/format_importer.tgz'
tar xzf format_importer.tgz

2. 经典使用方法

2.1 获取数据接收地址

首先从神策分析的主页中，获取数据接收的 URL 和 Token（Cloud 版）。

如果使用神策分析 Cloud 服务，需获取的配置信息为:

数据接收地址，建议使用不带端口号的: http://{$service_name}.datasink.sensorsdata.cn/sa?project={$project_name}&token={$project_token}
数据接收地址，带端口号的: http://{$service_name}.cloud.sensorsdata.cn:8106/sa?project={$project_name}&token={$project_token}

如果用户使用单机版私有部署的神策分析，默认的配置信息为：

数据接收地址: http://{$host_name}:8106/sa?project={$project_name}
（注：神策分析 1.7 及之前的版本，单机版私有部署默认端口号为 8006）

如果用户使用集群版私有部署的神策分析，默认的配置信息为：

数据接收地址: http://{$host_name}:8106/sa?project={$project_name}

其中 {$host_name} 可以是集群中任意一台机器。

如果私有部署的过程中修改了 Nginx 的默认配置，或通过 CDN 等访问神策分析，则请咨询相关人员获得配置信息。

2.2 示例场景

假设有一个电商网站，需要采集以下用户事件：

浏览事件，包括浏览的商品名称 ( item_name ) 和商品 id ( item_id)
购买事件，包括购买的商品名称 ( item_name ) 和商品 id ( item_id)

示例事件如下

用户

事件

时间

商品id

商品名称

bug29

浏览

2018-05-12 13:01:11

13245

男士护耳保暖鸭舌皮帽平顶八角帽头层牛皮帽子时尚休闲

bug29

购买

2018-05-12 13:05:03

13245

男士护耳保暖鸭舌皮帽平顶八角帽头层牛皮帽子时尚休闲

小武

浏览

2018-05-13 10:20:32

23421

New Order Technique 2CD豪华版欧版行货全新未拆

菠菜

浏览

2018-05-13 20:42:53

3442

NUK安抚奶嘴宝宝防胀气安慰奶嘴乳胶迪士尼安睡型

并采集以下用户属性：

用户的性别 ( gender )，男或女
用户是否是会员 ( is_member )
用户的会员积分 ( score )

示例用户属性如下

用户名

性别

是否为会员

会员积分

bug29

男

是

131

小武

女

否

<没有积分>

2.3 导入 csv 格式的数据

2.3.1 导入事件

假设有以下 csv 文件描述了上面的示例用户事件 ( 参考代码包下 examples/events.csv )。

user_id,action,time,item_id,item_name,item_cate
bug29,view,2018-05-12 13:01:11,13245,男士护耳保暖鸭舌皮帽平顶八角帽头层牛皮帽子时尚休闲,男装
bug29,buy,2018-05-12 13:05:03,13245,男士护耳保暖鸭舌皮帽平顶八角帽头层牛皮帽子时尚休闲,男装
小武,view,2018-05-13 10:20:32,23421,New Order Technique 2CD豪华版 欧版行货 全新未拆,音像
菠菜,view,2018-05-13 20:42:53,3442,NUK安抚奶嘴宝宝防胀气安慰奶嘴乳胶迪士尼安睡型,母婴

将这些数据导入到本地私有部署的环境，以 user_id 作为用户的 id，以 time 作为事件发生的时间，以 action 作为事件名称，只导入 item_id 和 item_name 作为事件属性。

python3 format_importer.py csv_event \
  --url 'http://localhost:8106/sa' \
  --distinct_id_from 'user_id' \
  --timestamp_from 'time' \
  --event_from 'action' \
  --filename './examples/events.csv' \
  --property_list 'item_id,item_name' \
  --debug # 正式使用的时候去掉--debug

也可以将上述参数写入配置文件中 ( 参考代码包下 conf/csv_event.conf )。

url: http://localhost:8106/sa
distinct_id_from: user_id
event_from: action
timestamp_from: time
filename: ./examples/events.csv
property_list: item_id,item_name
debug

然后执行

python3 format_importer.py csv_event @./conf/csv_event.conf

详细参数解释见 4.5 导入 csv 格式的其他参数

2.3.2 导入用户属性

假设有以下 csv 文件描述了上面的示例用户属性 ( 参考代码包下 examples/profiles.csv )。

user_id,gender,is_member,score
bug29,男,true,131
小武,女,false,

将这些数据导入到本地私有部署的环境，以 user_id 作为用户的 id。

python3 format_importer.py csv_profile \
  --url 'http://localhost:8106/sa' \
  --distinct_id_from 'user_id' \
  --filename './examples/profiles.csv' \
  --debug # 正式使用的时候去掉--debug

也可以将上述参数写入配置文件中 ( 参考代码包下 conf/csv_profile.conf )。

url: http://localhost:8106/sa
distinct_id_from: user_id
filename: ./examples/profiles.csv
debug

然后执行

python3 format_importer.py csv_profile @./conf/csv_profile.conf

详细参数解释见 4.5 导入 csv 格式的其他参数

2.4 导入 nginx 日志

2.4.1 导入事件

假设有以下 nginx 日志描述了上面的示例用户事件 ( 参考代码包下 examples/events.log )。

123.4.5.6 - [12/May/2018:13:01:11 +0800] "GET /item?id=13245&action=view&cate=%e7%94%b7%e8%a3%85" 200 1127 "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.85 Safari/537.36" "http://fake_web.com/login.html" "男士护耳保暖鸭舌皮帽平顶八角帽头层牛皮帽子时尚休闲" "bug29"
123.4.5.6 - [12/May/2018:13:05:03 +0800] "GET /item?id=13245&action=buy&cate=%e7%94%b7%e8%a3%85" 200 1127 "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.85 Safari/537.36" "http://fake_web.com/login.html" "男士护耳保暖鸭舌皮帽平顶八角帽头层牛皮帽子时尚休闲" "bug29"
123.4.5.7 - [13/May/2018:10:20:32 +0800] "GET /item?id=23421&action=view&cate=%e9%9f%b3%e5%83%8f" 200 1127 "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36" "http://www.baidu.com?q=abc" "New Order Technique 2CD豪华版 欧版行货 全新未拆" "小武"
123.8.5.7 - [13/May/2018:20:42:53 +0800] "GET /item?id=&action=view&cate=%e6%af%8d%e5%a9%b4" 200 1127 "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36" "http://www.baidu.com?q=abc" "NUK安抚奶嘴宝宝防胀气安慰奶嘴乳胶迪士尼安睡型" "菠菜"

对应 nginx 配置的格式如下:

log_format compression '$remote_addr [$time_local] "$request" $status $bytes_sent "$http_user_agent" "$http_referer" "$title" "$user_id"';
access_log /data/nginx_log/access.log compression;

将这些数据导入到本地私有部署的环境，以 $user_id 作为用户的 id，以 $time_local 作为事件发生的时间，$reqeust 解析后的参数 action 对应的值是事件名，只导入两个事件属性： $request 解析后的 id 作为 item_id，自定义的变量 $title 作为 item_name。

python3 format_importer.py nginx_event \
  --url 'http://localhost:8106/sa' \
  --distinct_id_from 'user_id' \
  --timestamp_from 'time_local' \
  --timestamp_format '%d/%b/%Y:%H:%M:%S %z' \
  --event_from '__request_param_action' \
  --filename './examples/events.log' \
  --log_format '$remote_addr [$time_local] "$request" $status $bytes_sent "$http_user_agent" "$http_referer" "$title" "$user_id"' \
  --property_list '__request_param_id,title' \
  --property_list_cnames 'item_id,item_name' \
  --debug # 正式使用的时候去掉--debug

也可以将上述参数写入配置文件中 ( 参考代码包下 conf/nginx_event.conf )。

url: http://localhost:8106/sa
distinct_id_from: user_id
event_from: __request_param_action
timestamp_from: time_local
timestamp_format: %d/%b/%Y:%H:%M:%S %z
filename: ./examples/events.log
log_format: $remote_addr [$time_local] "$request" $status $bytes_sent "$http_user_agent" "$http_referer" "$title" "$user_id"
property_list: __request_param_id,title
property_list_cnames: item_id,item_name
debug

然后执行

python3 format_importer.py nginx_event @./conf/nginx_event.conf

详细参数解释见 4.6 导入 nginx 格式的其他参数。

2.4.2 导入用户属性

假设有以下 nginx 日志描述了上面的示例用户属性 ( 参考代码包下 examples/profiles.log )。

123.4.5.6 - [12/May/2018:13:01:11 +0800] "POST /profile?user=bug29&is_member=true" 200 1127 "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.85 Safari/537.36" "http://fake_web.com/login.html" "男" "131"
123.4.5.7 - [13/May/2018:10:20:32 +0800] "POST /profile?user=%e5%b0%8f%e6%ad%a6&is_member=false" 200 1127 "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36" "http://www.baidu.com?q=abc" "女" ""

对应 nginx 配置的格式如下:

log_format compression '$remote_addr [$time_local] "$request" $status $bytes_sent "$http_user_agent" "$http_referer" "$gender" "$score"';
access_log /data/nginx_log/access.log compression;

将这些数据导入到本地私有部署的环境，以 $reqeust 解析后的参数 user 作为用户的 id，导入三个用户属性：自定义变量 $gender 和 $score ，以及$reqeust 解析后的参数 is_member。

python3 format_importer.py nginx_profile \
  --url 'http://localhost:8106/sa' \
  --distinct_id_from '__request_param_user' \
  --filename './examples/profiles.log' \
  --log_format '$remote_addr [$time_local] "$request" $status $bytes_sent "$http_user_agent" "$http_referer" "$gender" "$score"' \
  --property_list 'gender,score,__request_param_is_member' \
  --property_list_cnames 'gender,score,is_member' \
  --debug # 正式使用的时候去掉--debug

也可以将上述参数写入配置文件中 ( 参考代码包下 conf/nginx_profile.conf )。

url: http://localhost:8106/sa
distinct_id_from: __request_param_user
filename: ./examples/profiles.log
log_format: $remote_addr [$time_local] "$request" $status $bytes_sent "$http_user_agent" "$http_referer" "$gender" "$score"
property_list: gender,score,__request_param_is_member
property_list_cnames: gender,score,is_member
debug

然后执行

python3 format_importer.py nginx_profile @./conf/nginx_profile.conf

详细参数解释见 4.6 导入 nginx 格式的其他参数。

2.5 导入 mysql 的数据

注意使用 mysql 导入需要先安装相关库，请运行下面命令来安装 PyMySQL。

python3 -m pip install PyMySQL --upgrade

2.5.1 导入事件

假设有以下 mysql 表描述了上面的示例用户事件 ( 参考代码包下 examples/events.sql )。

drop table if exists events;
create table events (
    user_id varchar(100),
    action varchar(100),
    time timestamp,
    item_id int,
    item_name text,
    item_cate varchar(100));
insert into events values('bug29', 'view', '2018-05-12 13:01:11', 13245, '男士护耳保暖鸭舌皮帽平顶八角帽头层牛皮帽子时尚休闲', '男装');
insert into events values('bug29', 'buy', '2018-05-12 13:05:03', 13245, '男士护耳保暖鸭舌皮帽平顶八角帽头层牛皮帽子时尚休闲', '男装');
insert into events values('小武', 'view', '2018-05-13 10:20:32', 23421, 'New Order Technique 2CD豪华版 欧版行货 全新未拆', '音像');
insert into events values('菠菜', 'view', '2018-05-13 20:42:53', 3442, 'NUK安抚奶嘴宝宝防胀气安慰奶嘴乳胶迪士尼安睡型', '母婴');

将这些数据导入到本地私有部署的环境，以 user_id 作为用户的 id，以 time 作为事件发生的时间，以 action 作为事件名称，只导入 item_id 和 item_name 作为事件属性。最后的 ORDER BY 主要是保证多次查询数据顺序一致，这样假如导入一半失败了，可以使用我们的参数跳过导入成功的行数。

python3 format_importer.py mysql_event \
  --url 'http://localhost:8106/sa' \
  --distinct_id_from 'user_id' \
  --timestamp_from 'time' \
  --event_from 'action' \
  --user 'root' \
  --password 'pass' \
  --host 'localhost' \
  --port 3307 \
  --db 'test_db' \
  --sql 'select user_id, action, time, item_id, item_name from events order by time;' \
  --debug # 正式使用的时候去掉--debug

也可以将上述参数写入配置文件中 ( 参考代码包下 conf/mysql_event.conf )。

url: http://localhost:8106/sa
distinct_id_from: user_id
event_from: action
timestamp_from: time
user: root
password: pass
host: localhost
port: 3307
db: test_db
sql: select user_id, action, time, item_id, item_name from events order by time;
debug

然后执行

python3 format_importer.py mysql_event @./conf/mysql_event.conf

详细参数解释见 4.7 导入 mysql 格式的其他参数

2.5.2 导入用户属性

假设有以下 mysql 表描述了上面的示例用户属性 ( 参考代码包下 examples/profiles.sql )。

drop table if exists profiles;
create table profiles (
    user_id varchar(100),
    gender varchar(20),
    is_member bool,
    score int);
insert into profiles values('bug29', '男', true, 131);
insert into profiles values('小武', '女', false, null);

将这些数据导入到本地私有部署的环境，以 user_id 作为用户的 id，导入全部用户属性。最后的 ORDER BY 主要是保证多次查询数据顺序一致，这样假如导入一半失败了，可以使用我们的参数跳过导入成功的行数。

python3 format_importer.py mysql_profile \
  --url 'http://localhost:8106/sa' \
  --distinct_id_from 'user_id' \
  --user 'root' \
  --password 'pass' \
  --host 'localhost' \
  --port 3307 \
  --db 'test_db' \
  --sql 'select user_id, gender, is_member, score from profiles order by user_id;' \
  --bool_property_list 'is_member' \
  --debug # 正式使用的时候去掉--debug

也可以将上述参数写入配置文件中 ( 参考代码包下 conf/mysql_profile.conf )。

url: http://localhost:8106/sa
distinct_id_from: user_id
user: root
password: pass
host: localhost
port: 3307
db: test_db
sql: select user_id, gender, is_member, score from profiles order by user_id;
bool_property_list: is_member
debug

然后执行

python3 format_importer.py mysql_profile @./conf/mysql_profile.conf

详细参数解释见 4.7 导入 mysql 格式的其他参数

2.6 导入 json 格式的日志

用户也可以将日志写入文件中，每行是一个符合神策要求格式的json 表示事件或属性。假设有以下日志描述了上面的示例的事件和用户属性 ( 参考代码包下 examples/events_and_profiles.json )。

{"type":"track","time":1526101271000,"distinct_id":"bug29","properties":{"item_id":13245.0,"item_name":"\u7537\u58eb\u62a4\u8033\u4fdd\u6696\u9e2d\u820c\u76ae\u5e3d\u5e73\u9876\u516b\u89d2\u5e3d\u5934\u5c42\u725b\u76ae\u5e3d\u5b50\u65f6\u5c1a\u4f11\u95f2"},"event":"view","time_free":true}
{"type":"track","time":1526101503000,"distinct_id":"bug29","properties":{"item_id":13245.0,"item_name":"\u7537\u58eb\u62a4\u8033\u4fdd\u6696\u9e2d\u820c\u76ae\u5e3d\u5e73\u9876\u516b\u89d2\u5e3d\u5934\u5c42\u725b\u76ae\u5e3d\u5b50\u65f6\u5c1a\u4f11\u95f2"},"event":"buy","time_free":true}
{"type":"track","time":1526178032000,"distinct_id":"\u5c0f\u6b66","properties":{"item_id":23421.0,"item_name":"New Order Technique 2CD\u8c6a\u534e\u7248 \u6b27\u7248\u884c\u8d27 \u5168\u65b0\u672a\u62c6"}, "event":"view","time_free":true}
{"type":"track","time":1526215373000,"distinct_id":"\u83e0\u83dc","properties":{"item_id":3442.0,"item_name":"NUK\u5b89\u629a\u5976\u5634\u5b9d\u5b9d\u9632\u80c0\u6c14\u5b89\u6170\u5976\u5634\u4e73\u80f6\u8fea\u58eb\u5c3c\u5b89\u7761\u578b"},"event":"view","time_free":true}
{"type":"profile_set","time":1526263297951,"distinct_id":"bug29","properties":{"gender":"\u7537","is_member":true,"score":131.0},"time_free":true}
{"type":"profile_set","time":1526263297951,"distinct_id":"\u5c0f\u6b66","properties":{"gender":"\u5973","is_member":false},"time_free":true}

将这些数据导入到本地私有部署的环境，

python3 format_importer.py json \
  --url 'http://localhost:8106/sa' \
  --path './examples/events_and_profiles.json' \
  --debug # 正式使用的时候去掉--debug

也可以将上述参数写入 json.conf

url: http://localhost:8106/sa
path: ./examples/events_and_profiles.json
debug

然后执行

python3 format_importer.py json @./conf/json.conf

详细参数解释见 4.8 导入 json 日志的其他参数

2.7 导入 oracle 的数据

注意使用 oracle 导入需要先安装相关库，请运行下面命令来安装 cx_Oracle 并需要确保机器上包含了相关 oracle 客户端程序包。

python3 -m pip install cx_Oracle --upgrade

2.7.1 导入事件

假设有以下 oracle 表描述了上面的示例用户事件 ( 参考代码包下 examples/events.plsql )。

drop table if exists events;
create table events (
    user_id varchar(100),
    action varchar(100),
    time timestamp,
    item_id int,
    item_name text,
    item_cate varchar(100));
insert into events values('bug29', 'view', '2018-05-12 13:01:11', 13245, '男士护耳保暖鸭舌皮帽平顶八角帽头层牛皮帽子时尚休闲', '男装');
insert into events values('bug29', 'buy', '2018-05-12 13:05:03', 13245, '男士护耳保暖鸭舌皮帽平顶八角帽头层牛皮帽子时尚休闲', '男装');
insert into events values('小武', 'view', '2018-05-13 10:20:32', 23421, 'New Order Technique 2CD豪华版 欧版行货 全新未拆', '音像');
insert into events values('菠菜', 'view', '2018-05-13 20:42:53', 3442, 'NUK安抚奶嘴宝宝防胀气安慰奶嘴乳胶迪士尼安睡型', '母婴');

python3 format_importer.py oracle_event \
  --url 'http://localhost:8106/sa' \
  --distinct_id_from 'user_id' \
  --timestamp_from 'time' \
  --event_from 'action' \
  --user 'root' \
  --password 'pass' \
  --dsn '127.0.0.1/orcl' \
  --sql 'select user_id, action, time, item_id, item_name from events order by time' \
  --debug # 正式使用的时候去掉--debug

也可以将上述参数写入配置文件中 ( 参考代码包下 conf/oracle_event.conf )。

url: http://localhost:8106/sa
distinct_id_from: user_id
event_from: action
timestamp_from: time
user: root
password: pass
dsn: 127.0.0.1/orcl
sql: select user_id, action, time, item_id, item_name from events order by time;
debug

然后执行

python3 format_importer.py oracle_event @./conf/oracle_event.conf

详细参数解释见 4.9 导入 oracle 数据的其他参数

2.7.2 导入用户属性

假设有以下 oracle 表描述了上面的示例用户属性 ( 参考代码包下 examples/profiles.plsql )。

drop table if exists profiles;
create table profiles (
    user_id varchar(100),
    gender varchar(20),
    is_member bool,
    score int);
insert into profiles values('bug29', '男', true, 131);
insert into profiles values('小武', '女', false, null);

python3 format_importer.py oracle_profile \
  --url 'http://localhost:8106/sa' \
  --distinct_id_from 'user_id' \
  --user 'root' \
  --password 'pass' \
  --dsn '127.0.0.1/orcl \
  --sql 'select user_id, gender, is_member, score from profiles order by user_id' \
  --bool_property_list 'is_member' \
  --debug # 正式使用的时候去掉--debug

也可以将上述参数写入配置文件中 ( 参考代码包下 conf/oracle_profile.conf )。

url: http://localhost:8106/sa
distinct_id_from: user_id
user: root
password: pass
dsn: 127.0.0.1/orcl
sql: select user_id, gender, is_member, score from profiles order by user_id
bool_property_list: is_member
debug

然后执行

python3 format_importer.py oracle_profile @./conf/oracle_profile.conf

详细参数解释见 4.9 导入 oracle 数据的其他参数

3. 使用建议

先选用部分数据，增加 --debug 选项测试通过后再正式导入。增加 --debug 后进入调试模式，对于每条数据，若成功则发送的数据打到标准输出上，否则会打印出错信息。执行完毕之后打印读取多少条，出错多少条。
- 导入 csv / nginx 日志的时候，可以先用 head -1000 file_name > test_file 的方式先导入一部分数据到测试文件，然后使用测试文件导入
- 导入 mysql 数据的时候，可以在查询语句后面加上 LIMIT 1000 然后测试导入
运行时在 format_importer 目录下会产生日志, 日志名为 format_importer.log，包含比输出更全的调试信息，如果增加 --debug 后屏幕上输出比较多，可以查看日志查找出错信息和调试信息。
由于参数比较复杂，建议使用配置文件的方式传递参数，具体配置文件样例可以解压后查看 conf 目录。
对于 csv 日志导入，需要确保日志文件是有效的 csv 格式，建议先阅读 csv转义相关的内容。
由于 nginx 的日志格式所限，导入的 property 的名字可能看起来并不具有可读性，比如 __request_param_action 这样的，强烈建议使用 property_list_cnames 来转化成可读的属性名。
对于 mysql 导入，如果 sql 语句写的比较长的时候，容易出现 sql 传递给程序后 shell 转义错误的问题，建议出错时查看在 format_impoter 目录下的日志，会在启动后第一条日志里面写上传递的参数，请仔细查看和传递的 sql 是否一致。另外建议如果 sql 比较长，建议使用 --filename 的方式写入文件传递。
对于 mysql 导入，如果 sql 语句是两个表的 join, 那么指定列名的时候需要指定的是别名或者列名，而不是<表名>.<列名>。详细解释参见用 mysql 将两张表 join 起来导入，明明有数据为什么提示我 distinct_id / timestamp / event 为空。
如果需要更高效导入，建议先使用 --output_file 来导出到日志文件中去，然后使用 BatchImporter 或者 HDFSImporter 导入到集群中去。

4. 使用方法详解

4.1 子命令说明

子命令就是跟在执行脚本后的第一个参数，比如 2.1 中执行

python3 format_importer.py csv_event \
  --url 'http://localhost:8106/sa' \
  --event_default 'UserBuy' \
  --distinct_id_from 'user' \
  --timestamp_from 'buy_time' \
  --filename 'buy.csv'

使用的子命令是 csv_event，表示 csv 格式数据导入为event。目前支持以下7种子命令。

子命令名称

解释

csv_profile

将 csv 格式文件导入, 导入 profile

csv_event

将 csv 格式文件导入, 导入 event

mysql_profile

提供 sql，将 mysql 的数据导入, 导入 profile

mysql_event

提供 sql，将 mysql 的数据导入, 导入 event

nginx_profile

将 Nginx 日志导入, 导入 profile

nginx_event

将 Nginx 日志导入, 导入 event

json

将 Json 日志导入，注意日志不区分 event 还是 profile

oracle_profile

提供 sql，将 oracle 的数据导入, 导入 profile

oracle_event

提供 sql，将 oracle 的数据导入, 导入 event

如果想看单个子命令支持哪些参数，可以在子命令之后加 -h，将获取所有的参数和说明，如

python3 format_importer.py csv_event -h
python3 format_importer.py json -h

4.2 从配置文件中导入

format_importer 支持从配置文件中读取参数，使用方法是在子命令之后增加@<配置文件路径>。下载后的源码包里面即包含了默认的几个配置，在 conf 目录下，可以修改后使用配置文件导入。

举例想导入 csv 的 event，可以修改 conf/csv_event.conf，然后执行

python3 format_importer.py csv_event @./conf/csv_event.conf

4.3 公共参数

通用的公共参数包括：

参数名

别名

是否必填

解释

--url

-l

和 output_file 选一个必填

发送数据的 url，获取方式在 2.1 节已经进行过描述。

--output_file

-O

和 url 选一个必填

输出的文件名，输出每行是一个符合格式的json。

--project

-j

否

指定的project名，默认是default

--skip_cnt

-c

否

第一次运行请忽略，如果运行失败，需要跳过成功的那几行，这个就是指定跳过几行的。

--debug

-D

否

如果指定了就是使用debug模式，不会导入数据，只在stdout显示数据，参见调试模式

--quit_on_error

-Q

否

如果选中，则出现一条错误日志就会退出

此外，导入 csv 日志 / nginx 日志 / mysql 数据时需要区分是导入 event 还是 profile，二者有不同的公共参数；导入 json 日志时，只支持设置如上的公共参数。

4.4 event/profile 公共参数

对于profile, 导入需要指定哪一列作为 distinct_id.

对于 event，除了指定 distinct_id，还需要指定 event 和 timestamp。指定 event 目前支持两种方法，一种将所有的数据都认为是同一个 event，另一种是指定哪一列作为 event。同样指定 timestamp 目前支持两种方法，一种将所有数据都认为是同一个 timestamp，另一种是指定哪一列作为 timestamp。

参数名

别名

导入 profile 是否需要

导入 event 是否需要

解释

--distinct_id_from

-df

必填

从哪个字段作为 distinct_id，如果指定，则每条数据算作对应字段的用户的行为。

--is_login

选填

distinct_id 是否是 login id，默认不是；注：导入 json 格式数据不支持此参数，如需使用请在 json 数据中添加 "$is_login_id" 属性。

--event_from

-ef

不可以填

和 event_default 选一个必填

哪个字段作为 event 名，如果指定，则每条数据的事件名为对应字段的值。

--event_default

-ed

不可以填

和 event_from 选一个必填

默认的 event 名，如果指定，则将所有数据都算作这个 event 的。

--timestamp_from

-tf

不可以填

选填

哪个字段作为 time, 如果指定，则每条数据对应的时间为对应字段的值。

--timestamp_default

-td

不可以填

选填

默认的 timestamp, 如果指定，则将所有数据都算作这个时间的事件。

--timestamp_format

-tf

不可以填

选填

和 timestamp_from 一起使用，如果指定，并 timestamp_from 对应的字段是个字符串，可以通过这种方式指定时间格式。默认是%Y-%m-%d %H:%M:%S。

4.5 导入 csv 格式的其他参数

参数名

别名

是否必填

解释

--filename

-f

是

csv文件路径

--property_list

-pl

否

用逗号分割选取的 property, 举例-p name,time将会将 name 和 time 两列作为 property 导入。如果不填写则表示全部作为 property 导入。

--skip_identify

-i

否

对应的列将不会做自动类型判断，举例--skip_identify name,id将会将 name 和 id 不做类型判断，完全作为 string 导入如果不填写则表示全部的选中列都会自动做类型判断。

--ignore_value

否

指定某些值为空，比如指定 --ignore_value null 则所有的null都被认为是空值

--add_cname

-ac

否

是否添加中文名，只对 event 有效. 如果 csv 的表头是中文，程序会将对应的 property 名改为对应的拼音。这时，如果将 add_cname 选中，会自动再程序的最后把中英文的映射关系填上去，这样在Ui上看到的对应 property 就是中文的了。

--web_url

-w

如果选择 add_cname 则必填

web 访问的 url ，单机版类似http://localhost:8007, cloud 版类似http://xxx.cloud.sensorsdata.cn。

--username

-u

如果选择 add_cname 则必填

web 登录用户名。

--password

-p

如果选择 add_cname 则必填

web 登录密码。

--csv_delimiter

否

csv文件的列分隔符，默认为','，只接受单字符参数, 也可以传 + ascii的数字，比如\9表示是\t

--csv_quotechar

否

csv文件的引用字符，用于指定csv字符串的开始和结尾，默认为'"'，只接受单字符参数, 也可以传 + ascii的数字，比如\9表示是\t

--csv_prefetch_lines

否

csv文件预读行数，预读用于判断列的类型，默认为-1，即预读整个文件。注意如果数据分布不均（比如前几行某个字段没有但是后面有）不要加这个参数

4.6 导入 nginx 日志的其他参数

参数名

别名

是否必填

解释

--filename

-f

是

nginx 日志文件路径

--log_format

-F

是

nginx 日志配置，类似'"$remote_addr" "$time_local" "$http_refer" "$status"'。

--property_list

-pl

是

用逗号分割选取的 property, 举例--property_list http_refer,status将会将http_refer和status两列作为 property 导入。

--skip_identify

-i

否

对应的列将不会做自动类型判断，举例--skip_identify request_user,status将会将 request_user, status 不做类型判断，完全作为 string 导入。如果不填写则表示全部的选中列都会自动做类型判断。

--url_fields

-uf

否

对应的列将作为url解析，用逗号分割。解析后会生成__<字段名>_<解析内容>这样命名的property,解析内容包括netloc, path, query, param_<参数名>。举例对于$my_url字段值为http://www.abc.com/path/to/mine?k1=v1&k2=2,会解析为{"__my_url_netloc": "www.abc.com","__my_url_path": "/path/to/mine", "__my_url_query":"k1=v1&k2=v", "__my_url_param_k1": "v1","__my_url_param_k2":2}。注意可以再 property_list 配置这些字段。默认是"http_referer"。

--filter_path

-fp

否

过滤对应的 path ，可多选。这里的 path 取的是 $request的path。支持正则。举例 --filter_path '.*\.gif' --filter_path '/index\.html' 将过滤对 gif 的请求和 index 的请求。

--ip_from

-if

否

只对 event 有效, 哪个字段作为 ip, 如果指定，则每条数据对应的 ip 为对应字段的值, 默认是$remote_addr

--ignore_value

否

指定某些值为空，比如指定 --ignore_value null 则所有的null都被认为是空值

--property_list_cnames

否

用逗号分割property的对应名称, 需要和--property_list一一对应

4.7 导入 mysql 数据的其他参数

参数名

别名

是否必填

解释

--user

-u

是

mysql 的 username

--password

-p

是

mysql 的 password

--host

-i

是

mysql 的地址

--port

-P

是

mysql 的端口号

--db

-d

是

mysql 对应的 db 名

--sql

-q

和 filename 选一个必填

查询语句，建议加 order by 等方式保证多次查询结果顺序一致。

--filename

-f

和 sql 选一个必填

查询语句所在的文件路径，建议加 order by 等方式保证多次查询结果顺序一致。

--bool_property_list

-bp

否

逗号分割的bool类型属性列表，会将对应的属性值为1的转化为true，0转化为false

--case_sensitive

-cs

否

导入的属性名是否是大小写敏感，注意如果大小写不敏感会全部转化为大写，默认为true

4.8 导入 json 日志的其他参数

参数名

别名

是否必填

解释

--path

-p

是

日志的文件/目录路径

注意导入 json 日志，如果传递了日志目录，那么会遍历该目录下一级的所有的文件，并且按照字母顺序导入。本参数不支持嵌套目录。

4.9 导入 oracle 数据的其他参数

参数名

别名

是否必填

解释

--user

-u

是

oracle 的 username

--password

-p

是

oracle 的 password

--dsn

-dsn

是

oracle的dsn

--sql

-q

和 filename 选一个必填

查询语句，建议加 order by 等方式保证多次查询结果顺序一致。

--filename

-f

和 sql 选一个必填

查询语句所在的文件路径，建议加 order by 等方式保证多次查询结果顺序一致。

--bool_property_list

-bp

否

逗号分割的bool类型属性列表，会将对应的属性值为1的转化为true，0转化为false

--case_sensitive

-cs

否

导入的属性名是否是大小写敏感，注意如果大小写不敏感会全部转化为大写，默认为false

5. 常见问题

5.1 csv 的表头是中文是否可以支持

根据我们在数据格式里面的说明，property 的名称是不可以包含中文的，但是可以设置 property 在 UI 上显示为对应的中文名。通过配置--add_cname即可自动完成这一过程。使用上面的例子，buy.csv 格式如下:

用户名,购买时间,商品id,商品名称,商品类别
小明,2015-01-20 10:35:22,13579,真皮帽子 男士护耳保暖鸭舌皮帽平顶八角帽头层牛皮帽子时尚休闲,男装
小芳,2015-07-13 23:12:03,24680,官方正品ZINO 3D透亮无瑕BB霜SPF30PA++ 防晒遮瑕美白 小样 3ml,护肤
小武,2015-04-03 20:30:01,31415,New Order Technique 2CD豪华版 欧版行货 全新未拆,音像

导入参数如下：

python3 format_importer.py csv_event \
  --url 'http://localhost:8006/sa' \
  --event_default 'UserBuy' \
  --distinct_id_from '用户名' \
  --timestamp_from '购买时间' \
  --filename 'buy.csv' \
  --add_cname \
  --web_url 'http://localhost:8007' \
  --username admin \
  --password password

注意在不同的平台上对编码要求不同，需要保证默认编码和文件编码一致，具体请参考windows下使用说明。

5.2 如何配置 nginx 可以过滤掉静态文件？

假设 nginx 日志中包含了对 gif 文件, css 文件和 js 文件的请求，这些请求希望过滤掉，可以使用 --filter_path 来过滤。

python3 format_importer.py nginx_event \
  --filter_path '.*\.gif' \
  --filter_path '.*\.css' \
  --filter_path '.*\.js' \
  # 其他参数。。。

5.3 导入了一半出错了怎么办？

默认的情况下，出现解析错误的数据，导入工具会在运行过程中对错误打印错误原因和错误的行数，然后丢弃错误数据继续处理。打印日志类似这样的:

2015-10-28 14:58:52,020 808 WARNING failed to parse line 12
2015-10-28 14:58:52,021 809 WARNING Traceback (most recent call last):
  File "format_importer.py", line 804, in main
    sa.track(distinct_id, event, properties)
  File "/Users/padme/git/sa2/tools/format_importer/sensorsanalytics/sdk.py", line 118, in track
    data = self._normalize_data(data)
  File "/Users/padme/git/sa2/tools/format_importer/sensorsanalytics/sdk.py", line 149, in _normalize_data
    raise SensorsAnalyticsIllegalDataException("property [distinct_id] must not be empty")
sensorsanalytics.sdk.SensorsAnalyticsIllegalDataException: property [distinct_id] must not be empty

在运行结束的时候会打印读取(total_read)了多少行，写入(total_write)多少行，出错(error)了多少行，跳过(skip)了多少行，类似这样:

2015-10-28 14:58:52,023 618 INFO end import nginx
2015-10-28 14:58:52,024 838 INFO counter = {'error': 3, 'skip': 0, 'total': 300, 'total_read': 100, 'total_write': 97}.

如果希望能够出错就提示，可以增加选项 --quit_on_error，这样的话出错了的日志如下：

2015-10-28 14:58:29,499 808 WARNING failed to parse line 12
2015-10-28 14:58:29,502 809 WARNING Traceback (most recent call last):
  File "format_importer.py", line 804, in main
    sa.track(distinct_id, event, properties)
  File "/Users/padme/git/sa2/tools/format_importer/sensorsanalytics/sdk.py", line 118, in track
    data = self._normalize_data(data)
  File "/Users/padme/git/sa2/tools/format_importer/sensorsanalytics/sdk.py", line 149, in _normalize_data
    raise SensorsAnalyticsIllegalDataException("property [distinct_id] must not be empty")
sensorsanalytics.sdk.SensorsAnalyticsIllegalDataException: property [distinct_id] must not be empty

2015-10-28 14:58:29,502 618 INFO end import nginx
2015-10-28 14:58:29,502 835 ERROR failed to import, please fix it and run with[--skip_cnt 11] again!

注意下方提示，说明已经成功导入了11行，修复第12行的数据后在之前的命令上再加上参数 --skip_cnt 11 即可。

需要特别说明的是，对于 mysql，为了防止数据错误后不可恢复的问题，请务必保证查询sql多次调用结果一致，即：

没有新数据在写入，比如通过增加 WHERE 保证只导入历史数据。
查询结果增加排序选项保证顺序一致，比如增加 ORDER BY id。

5.4 用 mysql 将两张表 join 起来导入，明明有数据为什么提示我 distinct_id / timestamp / event 为空？

注意如果使用两个表 join，列名要么是唯一的，要么取别名。这里的列名一方面是表示使用 distinct_id_from, timestamp_from, event_from 的参数，另一方面也是导入后 property 的名字。

举例，sql 如下:

SELECT a.uid, a.event, a.time, b.property1, b.property2
FROM a JOIN b ON a.action_id = b.action_id

那么运行参数需要指定为

--distinct_id_from 'uid' \
--timestamp_from 'time' \
--event_from 'event'

导入的 property 的名字也分别是(property1, property2) 而不是(b.property1, b.property2)。

如果列名不是唯一的，另一种方法是取别名. 举例sql如下:

SELECT a.u AS uid, a.e AS event, a.t AS time, b.property AS property1, a.property AS property2
FROM a JOIN b ON a.action_id = b.action_id

那么运行的参数指定为:

--distinct_id_from 'uid' \
--timestamp_from 'time' \
--event_from 'event'

导入的 property 的名字也分别是(property1, property2).

5.5 用 mysql 导入的时候，如何将数值转化成文本/文本转化成数值导入？

mysql 的 CAST 方法支持类型转化。

举例，mysql 中有个类型为 int 的列 property1 和一个类型为 varchar(10) 的列为 property2，可以使用 sql:

SELECT CAST(property1 AS CHAR(10)) AS property1, CAST(property2 AS SIGNED) AS property2 FROM test_table;

将 property1 的值转化成 10 个字符长的文本，将 property2 的值转化为数值。

5.6 如何导入其他Project

如果没有显示指定 project，则默认导入默认 project (default)。

可以通过 --url 里面包含 project=<project_name> 来指定project 名，也可以通过 --project 参数来指定。如果同时指定则采用 --project 的参数。

注意，如果 --url 的值是通过右上角“账号” -> “数据接入” -> “复制数据接收地址” 来获取的话，则复制的接口里面自带 project 参数，不需要额外指定。

此外，如果是导入 json 日志，可以在 json 中增加 "project":"project名称" 来使用一份数据导入多个 project。日志里面的 project 字段优先级高于参数。

5.7 csv如何转义

csv是用逗号分割的文件格式，如果某一列的内容中包含逗号，需要用双引号分割开，否则format importer将会报错。举例，csv文件内容如下:

col1,col2,col3
a,b,c,d

运行时将会报错:

csv error near line 1: content has 1 more fields than header

即提示内容的列数比头部的列数多一列。正确的做法是加上双引号：

col1,col2,col3
a,"b,c",d

将会被识别为:

col1

col2

col3

b,c

注意双引号可以跨行。举例，csv文件内容如下：

col1,col2,col3
a,"b,c
d,e",f

将会被识别为:

col1

col2

col3

b,c\nd,e

因此如果某一列以双引号开头，那么csv会一直找到下一个双引号结尾。如果两个双引号直接的字符过多，会报错列超长：

_csv.Error: field larger than field limit (131072)

解决方法有两个，一个是可以通过 --csv_quotechar 参数替换 csv 默认的字符串边界符号，需要保证这个符号没有出现过。另外就是对双引号转义，需要把一个双引号变成两个，并且对本列用双引号引住。举例，csv文件内容如下:

col1,col2,col3
a,"""b",c

将会识别为

col1

col2

col3

5.8 对于 json 格式的日志，使用 FormatImporter 导入，和使用 LogAgent/BatchImporter导入有什么区别？

LogAgent 实时流式将日志内容导入，如果日志在不断更新，建议使用 LogAgent 导入. 而 FormatImporter 和 BatchImporter 都是一次性导入。
BatchImporter 在三者中导入效率最高，但是只能在部署神策分析的机器上使用。 FormatImporter , LogAgent 可以在任意机器上运行。

5.9 windows下使用说明

在 windows 下使用 FormatImporter 时，由于终端的语法不同，默认编码不同，有以下几个注意事项：

尽量使用配置文件传递参数，而不是命令行传递，避免命令转义。注意上述示例中的命令都是基于 unix 操作系统，如果直接在 windows 下面运行可能会出现语法错误或转义。
默认的 conf 目录下的示例配置文件，和 example 目录下的示例数据文件，全部是基于 utf8 编码的，而 windows 默认编码是 gbk，因此在 windows 环境中直接使用将失败。建议每次重新编写一个配置文件配置，对于数据文件建议直接在 windows 环境中生成。举例，如果需要使用 csv 做测试，建议直接从 excel 导出后测试。
如果需要在 windows 下访问 mysql / oracle，需要对编码做特殊处理，具体处理方式请咨询神策技术支持同学。

PreviousLogAgent 场景使用示例 NextHdfsImporter

Last updated 6 years ago

Was this helpful?