来自电子邮件的Tika元数据错过了日期(Tika metadata from email misses date)

我有两个电子邮件测试文件:

通过在Mac Mail中使用“另存为”创建的文件(这会创建.txt文件) 通过将电子邮件从Mac Mail拖到桌面创建的文件(这会创建一个.eml文件)

如果我用文件提供

curl -T filename http://localhost:9998/detect/stream

我得到两个文件的响应“message / rfc822”。

如果我跑

curl -T filename http://localhost:9998/meta

我得到了元数据,但在(1)的情况下,我没有提取日期,而在情况(2)我做。

当然,我理解.eml文件包含完整的原始标题,而.txt文件只包含一个非常简短的标题。 但是,即使缩写标题确实包含“日期”字段,因此我认为Tika应该提取它。 这是一个错误还是故意的? 在后一种情况下,有什么办法可以让Tika在案例(1)中提取日期吗?

我正在运行Tika-server 1.14。

I have two email testfiles:

A file that has been created by using "save as" in Mac Mail (this creates a .txt file) A file that has been created by dragging an email from Mac Mail to the Desktop (this creates an .eml file)

If I feed the files with

curl -T filename http://localhost:9998/detect/stream

I get the response "message/rfc822" for both files.

If I run

curl -T filename http://localhost:9998/meta

I get the metadata, but in the case of (1) I do not get the date extracted, while in case (2) I do.

I understand, of course, that the .eml file includes the full raw header, while the .txt file only includes a very abbreviated header. However, even the abbreviated header does include a "Date" field, and so I think Tika should extract it. Is this a bug or intentional? In the latter case, is there anything I could do to get the Tika to extract the date in case (1)?

I am running Tika-server 1.14.

最满意答案

感谢您开放TIKA-1970 ; 基础James'mime4j库无法解析格式日期“2016年5月16日格林威治标准时间09:30:32 +”。 我们将添加额外的日期解析代码以捕获mime4j在Tika级别无法识别的日期格式。

再次感谢您注意并在我们的JIRA上开启问题。

Thank you for opening TIKA-1970; the underlying James' mime4j library isn't able to parse a date of format "16 May 2016 at 09:30:32 GMT+1". We'll add extra date parsing code to catch those date formats that mime4j doesn't recognize at the Tika level.

Again, thank you for noticing and for opening an issue on our JIRA.

来自电子邮件的Tika元数据错过了日期(Tika metadata from email misses date)

我有两个电子邮件测试文件:

通过在Mac Mail中使用“另存为”创建的文件(这会创建.txt文件) 通过将电子邮件从Mac Mail拖到桌面创建的文件(这会创建一个.eml文件)

如果我用文件提供

curl -T filename http://localhost:9998/detect/stream

我得到两个文件的响应“message / rfc822”。

如果我跑

curl -T filename http://localhost:9998/meta

我得到了元数据,但在(1)的情况下,我没有提取日期,而在情况(2)我做。

当然,我理解.eml文件包含完整的原始标题,而.txt文件只包含一个非常简短的标题。 但是,即使缩写标题确实包含“日期”字段,因此我认为Tika应该提取它。 这是一个错误还是故意的? 在后一种情况下,有什么办法可以让Tika在案例(1)中提取日期吗?

我正在运行Tika-server 1.14。

I have two email testfiles:

A file that has been created by using "save as" in Mac Mail (this creates a .txt file) A file that has been created by dragging an email from Mac Mail to the Desktop (this creates an .eml file)

If I feed the files with

curl -T filename http://localhost:9998/detect/stream

I get the response "message/rfc822" for both files.

If I run

curl -T filename http://localhost:9998/meta

I get the metadata, but in the case of (1) I do not get the date extracted, while in case (2) I do.

I understand, of course, that the .eml file includes the full raw header, while the .txt file only includes a very abbreviated header. However, even the abbreviated header does include a "Date" field, and so I think Tika should extract it. Is this a bug or intentional? In the latter case, is there anything I could do to get the Tika to extract the date in case (1)?

I am running Tika-server 1.14.

最满意答案

感谢您开放TIKA-1970 ; 基础James'mime4j库无法解析格式日期“2016年5月16日格林威治标准时间09:30:32 +”。 我们将添加额外的日期解析代码以捕获mime4j在Tika级别无法识别的日期格式。

再次感谢您注意并在我们的JIRA上开启问题。

Thank you for opening TIKA-1970; the underlying James' mime4j library isn't able to parse a date of format "16 May 2016 at 09:30:32 GMT+1". We'll add extra date parsing code to catch those date formats that mime4j doesn't recognize at the Tika level.

Again, thank you for noticing and for opening an issue on our JIRA.