问题描述:

I have two files:

adjective,adverb,participle,verb

0,2,3,5,

1,2,5,6

and

adjective,adjunct,adverbial,participle,verb

0,2,3,5,4

1,2,5,6,5

1,2,5,6,5

I want to get output like this:

adjective,adjunct,adverb,adverbial,participle,verb

0,2,0,3,5,4

1,2,0,5,6,5

1,2,0,5,6,5

So that the columns were merged based on the headers and sorted in the alphabetic order. I do not care about preserving the numbers from the second files in added columns, they can be filled with 0. The important part is to add the columns that are missing and sort them in the alphabetic order.

Join does not help as it joins only by one column. Any ideas?

网友答案:

I don't understand why join isn't an option:

join -t, -a 1 -o 0,2.2,1.2,2.3,1.3,2.5 file1 file2 
adjective,adjunct,adverb,adverbial,participle,verb
0,2,2,3,3,4
1,2,2,5,5,5
1,2,2,5,5,5

-a specified the join field for each file, and -o specifies the output format (which fields from which file)


I may come back to this later. In the meantime, you can extract the merged column headers like this:

paste -d , file1 file2 | sed 1q | tr , '\n' | sed 's/  *$//' | sort -u | paste -d, -s 
adjective,adjunct,adverb,adverbial,participle,verb

OK, a GNU awk-only answer:

  • this reads the headers of file1 and the headers of file2 to get a unique set of headers.
  • uses the PROCINFO["sorted_in"] feature of gawk to traverse an associative array by lexically sorted order of the indices
gawk -F, '
    NR == 1 {
        n = split($0, f1cols, /,/)
        for (i=1; i<=n; i++) 
            allcols[f1cols[i]] = 1 
    }
    NR == FNR {next} # because you do not care about the values
    FNR == 1 {
        n = split($0, f2cols, /,/)
        for (i=1; i<=n; i++) {
            allcols[f2cols[i]] = 1
            f2colidx[f2cols[i]] = i
        }
        PROCINFO["sorted_in"] = "@ind_str_asc"
        sep = ""
        for (head in allcols) {
            printf "%s%s", sep, head
            sep = FS
        }
        print ""
        next
    }
    {
        sep = ""
        for (col in allcols) {
            val = (col in f2colidx) ? $(f2colidx[col]) : 0
            printf "%s%s", sep, val
            sep = FS
        }
        print ""
    }
' file1 file2
adjective,adjunct,adverb,adverbial,participle,verb
0,2,0,3,5,4
1,2,0,5,6,5
1,2,0,5,6,5
网友答案:

I used a solution similar to this. Somehow I managed to apply awk to it and it seems to do what I want.

head -1 -q annotation1.csv  annotation2.csv | tr , "\n" | sort | uniq > header.txt
header="header.txt"

awk  -F, -v colsFile="$header" -v OFS=','  'BEGIN {
        j=1
        while ((getline < colsFile) > 0) col[j++] = $1
        n=j-1;
        close(colsFile)
        for (i=1; i<=n; i++) {
           s[col[i]]=i 
           printf(col[i])"," 
        }
        print""
     }
     NR==1 {
        for (f=1; f<=NF; f++) c[s[$f]]=f
        next
     }
     { 
       for (f=1; f<=n; f++) 
           if (c[f] == "") {printf 0","} else printf $(c[f])","
       print ""
   }' $1 
相关阅读:
Top