Follow Techotopia on Twitter

On-line Guides
All Guides
eBook Store
iOS / Android
Linux for Beginners
Office Productivity
Linux Installation
Linux Security
Linux Utilities
Linux Virtualization
Linux Kernel
System/Network Admin
Programming
Scripting Languages
Development Tools
Web Development
GUI Toolkits/Desktop
Databases
Mail Systems
openSolaris
Eclipse Documentation
Techotopia.com
Virtuatopia.com
Answertopia.com

How To Guides
Virtualization
General System Admin
Linux Security
Linux Filesystems
Web Servers
Graphics & Desktop
PC Hardware
Windows
Problem Solutions
Privacy Policy

  




 

 

The sed FAQ
Prev Home Next

4.12. How do I parse a comma-delimited (CSV) data file?

Comma-delimited data files can come in several forms, requiring increasing levels of complexity in parsing and handling. They are often referred to as CSV files (for "comma separated values") and occasionally as SDF files (for "standard data format"). Note that some vendors use "SDF" to refer to variable-length records with comma-separated fields which are "double-quoted" if they contain character values, while other vendors use "SDF" to designate fixed-length records with fixed-length, nonquoted fields! (For help with fixed-length fields, see question 4.23)

The term "CSV" became a de-facto standard when Microsoft Excel used it as an optional output file format.

Here are 4 different forms you may encounter in comma-delimited data:

(a) No quotes, no internal commas

       1001,John Smith,PO Box 123,Chicago,IL,60699
       1002,Mary Jones,320 Main,Denver,CO,84100,

(b) Like (a), with quotes around each field

       "1003","John Smith","PO Box 123","Chicago","IL","60699"
       "1004","Mary Jones","320 Main","Denver","CO","84100"

(c) Like (b), with embedded commas

       "1005","Tom Hall, Jr.","61 Ash Ct.","Niles","OH","44446"
       "1006","Bob Davis","429 Pine, Apt. 5","Boston","MA","02128"

(d) Like (c), with embedded commas and quotes

       "1007","Sue "Red" Smith","19 Main","Troy","MI","48055"
       "1008","Joe "Hey, guy!" Hall","POB 44","Reno","NV","89504"

In each example above, we have 7 fields and 6 commas which function as field separators. Case (c) is a very typical form of these data files, with double quotes used to enclose each field and to protect internal commas (such as "Tom Hall, Jr.") from interpretation as field separators. However, many times the data may include both embedded quotation marks as well as embedded commas, as seen by case (d), above.

Case (d) is the closest to Microsoft CSV format. However, the Microsoft CSV format allows embedded newlines within a double-quoted field. If embedded newlines within fields are a possibility for your data, you should consider using something other than sed to work with the data file.

Before handling a comma-delimited data file, make sure that you fully understand its format and check the integrity of the data. Does each line contain the same number of fields? Should certain fields be composed only of numbers or of two-letter state abbreviations in all caps? Sed (or awk or perl) should be used to validate the integrity of the data file before you attempt to alter it or extract particular fields from the file.

After ensuring that each line has a valid number of fields, use sed to locate and modify individual fields, using the \(...\) grouping command where needed.

In case (a):

     sed 's/^[^,]*,[^,]*,[^,]*,[^,]*,/.../'
^ ^ ^ | | |_ 3rd field | |_______ 2nd field |_____________ 1st field
     # Unix script to delete the second field for case (a)
     sed 's/^\([^,]*\),[^,]*,/\1,,/' file
     # Unix script to change field 1 to 9999 for case (a)
     sed 's/^[^,]*,/9999,/' file

In cases (b) and (c):

     sed 's/^"[^"]*","[^"]*","[^"]*","[^"]*",/.../'
1st-- 2nd-- 3rd-- 4th--
     # Unix script to delete the second field for case (c)
     sed 's/^\("[^"]*"\),"[^"]*",/\1,"",/' file
     # Unix script to change field 1 to 9999 for case (c)
     sed 's/^"[^"]*",/"9999",/' file

In case (d):

One way to parse such files is to replace the 3-character field separator "," with an unused character like the tab or vertical bar. (Technically, the field separator is only the comma while the fields are surrounded by "double quotes", but the net effect is that fields are separated by quote-comma-quote, with quote characters added to the beginning and end of each record.) Search your datafile first to make sure that your character appears nowhere in it!

     sed -n '/|/p' file        # search for any instance of '|'
     # if it's not found, we can use the '|' to separate fields

Then replace the 3-character field separator and parse as before:

     # sed script to delete the second field for case (d)
     s/","/|/g;                  # global change of "," to bar
     s/^\([^|]*\)|[^|]|/\1||/;   # delete 2nd field
     s/|/","/g;                  # global change of bar back to ","
     #---end of script---
     # sed script to change field 1 to 9999 for case (d)
     # Remember to accommodate leading and trailing quote marks
     s/","/|/g;
     s/^[^|]*|/"9999|/;
     s/|/","/g;
     #---end of script---

Note that this technique works only if each and every field is surrounded with double quotes, including empty fields.

The following solution is for more complex examples of (d), such as: not all fields contain "double-quote" marks, or the presence of embedded "double-quote" marks within fields, or extraneous whitespace around field delimiters. (Thanks to Greg Ubben for this script!)

     # sed script to convert case (d) to bar-delimited records
     s/^ *\(.*[^ ]\) *$/|\1|/;
     s/" *, */"|/g;
     : loop
     s/| *\([^",|][^,|]*\) *, */|\1|/g;
     s/| *, */|\1|/g;
     t loop
     s/  *|/|/g;
     s/|  */|/g;
     s/^|\(.*\)|$/\1/;
     #---end of script---

For example, it turns this (which is badly-formed but legal):

   first,"",unquoted ,""this" is, quoted " ,, sub "quote" inside, f", lone  " empty:

into this:

   first|""|unquoted|""this" is, quoted "||sub "quote" inside|f"|lone  "   empty:

Note that the script preserves the "double-quote" marks, but changes only the commas where they are used as field separators. I have used the vertical bar "|" because it's easier to read, but you may change this to another field separator if you wish.

If your CSV datafile is more complex, it would probably not be worth the effort to write it in sed. For such a case, you should use Perl with a dedicated CSV module (there are at least two recent CSV parsers available from CPAN).

The sed FAQ
Prev Home Next

 
 
   Reprinted courtesy of Eric Pement. Also available at https://sed.sourceforge.net/sedfaq.html Design by Interspire