加入收藏 | 设为首页 | 会员中心 | 我要投稿 常州站长网 (https://www.0519zz.com/)- 科技、建站、经验、云计算、5G、大数据,站长网!
当前位置: 首页 > 服务器 > 搭建环境 > Linux > 正文

regex – sed – 删除大型csv文件中引号内的引号

发布时间:2020-12-24 13:52:45 所属栏目:Linux 来源:网络整理
导读:我正在使用流编辑器sed将大量文本文件数据(400MB)转换为csv格式. 我已经非常接近完成,但突出的问题是引号内的引号,对于这样的数据: 1,word1,"description for word1","another text",""text contains "double quotes" some more text"2,word2,"descriptio

我正在使用流编辑器sed将大量文本文件数据(400MB)转换为csv格式.

我已经非常接近完成,但突出的问题是引号内的引号,对于这样的数据:

1,word1,"description for word1","another text",""text contains "double quotes" some more text"
2,word2,"description for word2","text may not contain double quotes,but may contain commas,"
3,word3,"description for "word3"","more text and more"

所需的输出是:

1,"text contains double quotes some more text"
2,"description for word3","more text and more"

我一直在寻找帮助,但我没有太接近解决方案,我尝试了以下seds与正则表达式模式:

sed -i 's/(?<!^s*|,)""(?!,""|s*$)//g' *.txt
sed -i 's/(?<=[^,])"(?=[^,])//g' *.txt

这些来自以下问题,但似乎不适用于sed:

Related question for perl

Related question for SISS

原始文件是* .txt,我正在尝试用sed编辑它们.

解决方法

这是使用GNU awk和 FPAT变量的一种方法:
gawk 'BEGIN { FPAT="([^,]+)|("[^"]+")"; OFS=","; N=""" } { for (i=1;i<=NF;i++) if ($i ~ /^".*"$/) { gsub(/"/,"",$i); $i=N $i N } }1' file

结果:

1,"text contains double
quotes some more text" 2,"another
text","more text and more"

说明:

Using FPAT,a field is defined as either “anything that is not a comma,” or “a double quote,anything that is not a double quote,and a closing double quote”. Then on every line of input,loop through each field and if the field starts and ends with a double quote,remove all quotes from the field. Finally,add double quotes surrounding the field.

(编辑:常州站长网)

【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容!

    热点阅读